From 98e72c1bc36c559318f781c3da7010c9f0da390a Mon Sep 17 00:00:00 2001 From: Martin Mares Date: Mon, 15 Jan 2024 22:08:34 +0100 Subject: [PATCH] English notes on Ukkonen's algorithm --- 10-suffix/ukkonen-en.tex | 180 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 180 insertions(+) create mode 100644 10-suffix/ukkonen-en.tex diff --git a/10-suffix/ukkonen-en.tex b/10-suffix/ukkonen-en.tex new file mode 100644 index 0000000..ef4de45 --- /dev/null +++ b/10-suffix/ukkonen-en.tex @@ -0,0 +1,180 @@ +\input ../sgr.tex + +\ehyph +\def\proof{\noindent {\sl Proof:} } + +\prednaska{10b}{Ukkonen's Suffix Tree Algorithm}{} + +\figure{baraba.epdf}{Suffixes of the word \uv{baraba}: trie, suffix tree, ST with a~dollar}{} + +\s{Observation:} Consider a~suffix tree for a~word~$\sigma$. Nodes of the tree (including hidden nodes) +correspond to prefixes of suffixes of~$\sigma$, that is the subwords of~$\sigma$. Leaves of the tree +are suffixes, which occur nowhere else in~$\sigma$ --- these are called the {\I non-nested} suffixes. +Internal nodes correspond to {\I branching subwords} --- subwords $\alpha\subseteq\sigma$ such that $\alpha a\subset\sigma$ +and $\alpha b\subseteq\sigma$ for two different symbols $a$ and~$b$. + +\h{Ukkonen's algorithm} + +\>Ukkonen described an algorithm \cite{ukkonen95line} for constructing the suffix tree without +the terminating dollar. It is an incremental algorithm. It starts with a~tree for an empty word +and it gradually appends new characters to the word, while updating the suffix tree. Each character +is added in amortized constant time. Therefore, it constructs the ST for a~word~$\sigma$ in time +$\O(\vert\sigma\vert)$. + +We will assume that edges to children of a~single node can be indexed by their initial +symbols --- this certainly holds if the alphabet is fixed and small. In case it is not, we can +use hash tables instead of arrays. + +\s{Observation:} When we extend a~word~$\sigma$ to~$\sigma a$, ST changes as follows: + +\numlist\ndotted +\:All existing nodes of the tree (including hidden ones) correspond to subwords of~$\sigma$. + These are also subwords of~$\sigma a$, so they occur as nodes of the new tree, too. +\:If $\beta$ was a~branching subword, it stays branching --- so internal nodes stay such. +\:Each new suffix~$\beta a$ is obtained by extending some original suffix~$\beta$. Here: + \itemize\ibull + \:If $\beta$ was non-nested (so a~leaf), $\beta a$~will be also non-nested. + So leaves stay leaves, but the labels of their edges must be extended by~$a$. + To make this efficient, we introduce {\I open edges,} whose labels index~$\sigma$ + from a~given position to the end. So leaves will take care of themselves. + \:If $\beta$ was a~nested suffix (i.e., an internal or hidden node), then: + \itemize\ibull + \:Either $\beta a$ is present in~$\sigma$. Then it is a~nested suffix of the new word + and the tree needs no change; + \:or $\beta a$ does not occur in~$\sigma$. Then we have to create a~new leaf + with an~open edge and possibly a~new internal node, under which the new leave will be connected. + \endlist + \endlist +\endlist + +This describes how the tree changes when a~new symbol is appended to~$\sigma$. +It remains to show how to do it efficiently. + +\s{Nested suffixes:} +First of all, we need to recognize which suffixes are nested and which are not. +It helps that nested suffixes form an interval: + +{\narrower +\s{Lemma:} If $\alpha$ is a~nested suffix of a~word~$\sigma$ and $\beta$ is a~suffix of~$\alpha$, +then $\beta$ is also a~nested suffix of~$\sigma$. + +\proof +The word $\sigma$ contains both $\alpha x$ and $\alpha y$ for some distinct symbols $x$ and~$y$. +Since $\alpha$ ends with~$\beta$, there must be both $\beta x$ and $\beta y$ somewhere in~$\sigma$, +so~$\beta$ is nested. +\qed + +} + +\>It therefore suffices to maintain the {\I longest nested suffix} of the word~$\sigma$. +We will all it the {\I active suffix} and denote it by $\alpha(\sigma)$. Each suffix $\beta\subseteq\sigma$ +is nested if and only if $\vert\beta\vert \le \vert\alpha(\sigma)\vert$. + +The active suffix demarcates a~boundary between non-nested and nested suffixes. +How does this boundary move when we extend $\sigma$ to $\sigma a$? +The answer is simple: + +{\narrower +\s{Lemma:} For every $\sigma$ and~$a$: $\alpha(\sigma a)$ is a~suffix of $\alpha(\sigma)a.$ + +\proof +Both $\alpha(\sigma a)$ and $\alpha(\sigma)a$ are suffixes of~$\sigma a$, so it suffices to compare +their lengths. The word $\beta := \hbox{\uv{$\alpha(\sigma a)$ without the final~$a$}}$ is a~nested suffix +of~$\sigma$, so $\vert\beta\vert \le \vert\alpha(\sigma)\vert$, which implies $\vert\alpha(\sigma a)\vert = \vert\beta a\vert \le \vert\alpha(\sigma)a\vert$. +\qed + +} + +\>In other words, the boundary can only move to the right or stay as it was. +This leads to the following idea. + +\s{Sketch of algorithm:} We maintain $\alpha=\alpha(\sigma)$ and when appending +a~new character~$a$, we check if $\alpha a$ stays nested. If it does, nothing changes. +If it doesn't, we add a~new leaf and possibly also a~new internal node; then we remove +the first character of~$\alpha$ and continue checking. + +\s{Observation:} Appending a~character to~$\sigma$ causes an~amortized constant number +of tree modifications. This holds because every modification shortens~$\alpha$, but $\alpha$ +grows only by a~single character per append to~$\sigma$. It remains to show how to +perform each modification in (amortized) constant time. In order to accomplish that, +we need a~suitable representation of the word~$\alpha$, which supports efficient +appends, cuts of the first character, and tests of existence of the corresponding node +in the tree. + +\h{Reference pairs} + +\s{Definition:} A~{\I reference pair} for a~word $\alpha\subseteq\sigma$ is a~pair +$(\pi,\tau)$, where $\pi$ is a~tree node, $\tau$ an arbitrary word and $\pi\tau=\alpha$. +Furthermore, we know that $\tau\subseteq\sigma$, so~$\tau$ can be represented by a~pair +of indices in~$\sigma$. + +A~reference pair is {\I canonical,} if there is no edge from the node~$\pi$ with a~label, +which is a~prefix of~$\tau$. (Please note that such an edge can be found by just comparing +the first character of its label with $\tau[0]$ and checking if the length of the label +is at most~$\vert\tau\vert$. It is not necessary to check if the rest of the label matches~$\tau[1:{}]$.) + +\s{Observation:} Each word $\alpha\subseteq\sigma$ has exactly one canonical reference pair +representing it. (Among all reference pairs representing~$\alpha$, it is the one with +the deepest possible node~$\pi$.) + +\s{Definition:} The {\I back edge} $\(\pi)$ leads from an internal node~$\pi$ to +an internal node, which represents $\pi[1:{}]$. (Let us observe that such a~node must exist: +every suffix of a~branching subword is also branching.) + +\s{Operations:} +We are going to represent the active suffix~$\alpha$ by a~reference pair $(\pi,\tau)$. We need +to perform the following operations on the pair: + +\itemize\ibull +\:{\I Append a~symbol~$a$:} We append~$a$ to~$\tau$. We obtain a~reference pair + for~$\alpha a$, but it is not necessarily canonical. Before appending, we can easily check + if $\alpha$ is present in the tree. +\:{\I Removing the first character:} If $\pi$ is not the tree root, we replace~$\pi$ + by $\(\pi)$ and keep~$\tau$. Otherwise $\pi$ is the empty string, so we remove + the first character from~$\tau$ (we just increment the corresponding start index). +\:{\I Canonicalize:} The previous two operations can produce a~non-canonical pair. + So we check if the pair is canonical: for $\tau\ne\varepsilon$, we test if there is an + edge from~$\pi$ indexed by~$\tau[0]$ is short enough to be a~prefix of~$\tau$. + If it is, we follow this edge, which makes $\pi$ longer and $\tau$ shorter; then we + repeat the test. Let us see that this takes amortized constant time, because every + step of canonicalization shortens~$\tau$, but $\tau$ grows only by a~single character + per append operation. +\endlist + +\>Now, all building blocks are ready and we can formulate the complete algorithm. + +\h{The full algorithm} + +\algo +\:{\I Input:} $\alpha=\alpha(\sigma)$ represented by a~canonical reference pair $(\pi,\tau)$, + the suffix tree $T$ for~$\sigma$, back edges \, and a~new symbol~$a$. +\:We check if $\alpha a$ is present in the tree, and create it if needed: +\::If $\tau=\varepsilon$: ($\alpha=\pi$ is an internal node) +\:::If there is an edge from the node~$\pi$ whose label starts with~$a$, then $\alpha a$ is present. +\:::If there is no such edge, it isn't present, so we create a~new open edge going from~$\pi$ + to a~new leaf. +\::Otherwise $\tau\ne\varepsilon$: ($\alpha$ is a~hidden node) +\:::We find an edge from~$\pi$ whose label starts with~$\tau$ (by checking $\tau[0]$). +\:::If $\tau$ in the label is followed by~$a$, then $\alpha a$ is present. +\:::Otherwise, it isn't present, so we subdivide the edge by creating a~new node + such that the edge from~$\pi$ to the new node is labeled by~$\tau$. The new node + will have an~open edge to a~new child. +\:If $\alpha a$ was not present, we remove the first character of~$\alpha$ and restart from step~2. +\:Now, we know that $\alpha a$ is present, so we update the reference pair to represent $\alpha a$. +\:We recalculate back edges (see below). +\:{\I Output:} $\alpha=\alpha(\sigma a)$ as a~canonical reference pair $(\pi,\tau)$, + the suffix tree $T$ for~$\sigma a$, back edges \. +\endalgo + +\s{Back edges:} +It remains to show how to make back edges of new internal nodes. +We can observe that if we create a~node, this node corresponds to the current~$\alpha$ +and its back edge goes to $\alpha[1:{}]$ --- this is the node that we create (or discover +that it already exists) in the next iteration of the main loop. +During the next iteration, this back edge will not be needed yet, because $\vert\pi\vert$ +is always decreasing. +Therefore we can delay creation of back edges by one iteration. +This way, we create back edges in constant time per node. + +\references +\bye -- 2.39.2