English notes on Ukkonen's algorithm

author Martin Mares <mj@ucw.cz>

Mon, 15 Jan 2024 21:08:34 +0000 (22:08 +0100)

committer Martin Mares <mj@ucw.cz>

Mon, 15 Jan 2024 21:08:34 +0000 (22:08 +0100)
author Martin Mares <mj@ucw.cz>
Mon, 15 Jan 2024 21:08:34 +0000 (22:08 +0100)
committer Martin Mares <mj@ucw.cz>
Mon, 15 Jan 2024 21:08:34 +0000 (22:08 +0100)
diff --git a/10-suffix/ukkonen-en.tex b/10-suffix/ukkonen-en.tex

new file mode 100644 (file)

index 0000000..ef4de45
--- /dev/null
+++ b/10-suffix/ukkonen-en.tex
@@ -0,0 +1,180 @@
+\input ../sgr.tex
+
+\ehyph
+\def\proof{\noindent {\sl Proof:} }
+
+\prednaska{10b}{Ukkonen's Suffix Tree Algorithm}{}
+
+\figure{baraba.epdf}{Suffixes of the word \uv{baraba}: trie, suffix tree, ST with a~dollar}{}
+
+\s{Observation:} Consider a~suffix tree for a~word~$\sigma$. Nodes of the tree (including hidden nodes)
+correspond to prefixes of suffixes of~$\sigma$, that is the subwords of~$\sigma$. Leaves of the tree
+are suffixes, which occur nowhere else in~$\sigma$ --- these are called the {\I non-nested} suffixes.
+Internal nodes correspond to {\I branching subwords} --- subwords $\alpha\subseteq\sigma$ such that $\alpha a\subset\sigma$
+and $\alpha b\subseteq\sigma$ for two different symbols $a$ and~$b$.
+
+\h{Ukkonen's algorithm}
+
+\>Ukkonen described an algorithm \cite{ukkonen95line} for constructing the suffix tree without
+the terminating dollar. It is an incremental algorithm. It starts with a~tree for an empty word
+and it gradually appends new characters to the word, while updating the suffix tree. Each character
+is added in amortized constant time. Therefore, it constructs the ST for a~word~$\sigma$ in time
+$\O(\vert\sigma\vert)$.
+
+We will assume that edges to children of a~single node can be indexed by their initial
+symbols --- this certainly holds if the alphabet is fixed and small. In case it is not, we can
+use hash tables instead of arrays.
+
+\s{Observation:} When we extend a~word~$\sigma$ to~$\sigma a$, ST changes as follows:
+
+\numlist\ndotted
+\:All existing nodes of the tree (including hidden ones) correspond to subwords of~$\sigma$.
+  These are also subwords of~$\sigma a$, so they occur as nodes of the new tree, too.
+\:If $\beta$ was a~branching subword, it stays branching --- so internal nodes stay such.
+\:Each new suffix~$\beta a$ is obtained by extending some original suffix~$\beta$. Here:
+  \itemize\ibull
+  \:If $\beta$ was non-nested (so a~leaf), $\beta a$~will be also non-nested.
+    So leaves stay leaves, but the labels of their edges must be extended by~$a$.
+    To make this efficient, we introduce {\I open edges,} whose labels index~$\sigma$
+    from a~given position to the end. So leaves will take care of themselves.
+  \:If $\beta$ was a~nested suffix (i.e., an internal or hidden node), then:
+    \itemize\ibull
+      \:Either $\beta a$ is present in~$\sigma$. Then it is a~nested suffix of the new word
+        and the tree needs no change;
+      \:or $\beta a$ does not occur in~$\sigma$. Then we have to create a~new leaf
+        with an~open edge and possibly a~new internal node, under which the new leave will be connected.
+    \endlist
+  \endlist
+\endlist
+
+This describes how the tree changes when a~new symbol is appended to~$\sigma$.
+It remains to show how to do it efficiently.
+
+\s{Nested suffixes:}
+First of all, we need to recognize which suffixes are nested and which are not.
+It helps that nested suffixes form an interval:
+
+{\narrower
+\s{Lemma:} If $\alpha$ is a~nested suffix of a~word~$\sigma$ and $\beta$ is a~suffix of~$\alpha$,
+then $\beta$ is also a~nested suffix of~$\sigma$.
+
+\proof
+The word $\sigma$ contains both $\alpha x$ and $\alpha y$ for some distinct symbols $x$ and~$y$.
+Since $\alpha$ ends with~$\beta$, there must be both $\beta x$ and $\beta y$ somewhere in~$\sigma$,
+so~$\beta$ is nested.
+\qed
+
+}
+
+\>It therefore suffices to maintain the {\I longest nested suffix} of the word~$\sigma$.
+We will all it the {\I active suffix} and denote it by $\alpha(\sigma)$. Each suffix $\beta\subseteq\sigma$
+is nested if and only if $\vert\beta\vert \le \vert\alpha(\sigma)\vert$.
+
+The active suffix demarcates a~boundary between non-nested and nested suffixes.
+How does this boundary move when we extend $\sigma$ to $\sigma a$?
+The answer is simple:
+
+{\narrower
+\s{Lemma:} For every $\sigma$ and~$a$: $\alpha(\sigma a)$ is a~suffix of $\alpha(\sigma)a.$
+
+\proof
+Both $\alpha(\sigma a)$ and $\alpha(\sigma)a$ are suffixes of~$\sigma a$, so it suffices to compare
+their lengths. The word $\beta := \hbox{\uv{$\alpha(\sigma a)$ without the final~$a$}}$ is a~nested suffix
+of~$\sigma$, so $\vert\beta\vert \le \vert\alpha(\sigma)\vert$, which implies $\vert\alpha(\sigma a)\vert = \vert\beta a\vert \le \vert\alpha(\sigma)a\vert$.
+\qed
+
+}
+
+\>In other words, the boundary can only move to the right or stay as it was.
+This leads to the following idea.
+
+\s{Sketch of algorithm:} We maintain $\alpha=\alpha(\sigma)$ and when appending
+a~new character~$a$, we check if $\alpha a$ stays nested. If it does, nothing changes.
+If it doesn't, we add a~new leaf and possibly also a~new internal node; then we remove
+the first character of~$\alpha$ and continue checking.
+
+\s{Observation:} Appending a~character to~$\sigma$ causes an~amortized constant number
+of tree modifications. This holds because every modification shortens~$\alpha$, but $\alpha$
+grows only by a~single character per append to~$\sigma$. It remains to show how to
+perform each modification in (amortized) constant time. In order to accomplish that,
+we need a~suitable representation of the word~$\alpha$, which supports efficient
+appends, cuts of the first character, and tests of existence of the corresponding node
+in the tree.
+
+\h{Reference pairs}
+
+\s{Definition:} A~{\I reference pair} for a~word $\alpha\subseteq\sigma$ is a~pair
+$(\pi,\tau)$, where $\pi$ is a~tree node, $\tau$ an arbitrary word and $\pi\tau=\alpha$.
+Furthermore, we know that $\tau\subseteq\sigma$, so~$\tau$ can be represented by a~pair
+of indices in~$\sigma$.
+
+A~reference pair is {\I canonical,} if there is no edge from the node~$\pi$ with a~label,
+which is a~prefix of~$\tau$. (Please note that such an edge can be found by just comparing
+the first character of its label with $\tau[0]$ and checking if the length of the label
+is at most~$\vert\tau\vert$. It is not necessary to check if the rest of the label matches~$\tau[1:{}]$.)
+
+\s{Observation:} Each word $\alpha\subseteq\sigma$ has exactly one canonical reference pair
+representing it. (Among all reference pairs representing~$\alpha$, it is the one with
+the deepest possible node~$\pi$.)
+
+\s{Definition:} The {\I back edge} $\<back>(\pi)$ leads from an internal node~$\pi$ to
+an internal node, which represents $\pi[1:{}]$. (Let us observe that such a~node must exist:
+every suffix of a~branching subword is also branching.)
+
+\s{Operations:}
+We are going to represent the active suffix~$\alpha$ by a~reference pair $(\pi,\tau)$. We need
+to perform the following operations on the pair:
+
+\itemize\ibull
+\:{\I Append a~symbol~$a$:} We append~$a$ to~$\tau$. We obtain a~reference pair
+  for~$\alpha a$, but it is not necessarily canonical. Before appending, we can easily check
+  if $\alpha$ is present in the tree.
+\:{\I Removing the first character:} If $\pi$ is not the tree root, we replace~$\pi$
+  by $\<back>(\pi)$ and keep~$\tau$. Otherwise $\pi$ is the empty string, so we remove
+  the first character from~$\tau$ (we just increment the corresponding start index).
+\:{\I Canonicalize:} The previous two operations can produce a~non-canonical pair.
+  So we check if the pair is canonical: for $\tau\ne\varepsilon$, we test if there is an
+  edge from~$\pi$ indexed by~$\tau[0]$ is short enough to be a~prefix of~$\tau$.
+  If it is, we follow this edge, which makes $\pi$ longer and $\tau$ shorter; then we
+  repeat the test. Let us see that this takes amortized constant time, because every
+  step of canonicalization shortens~$\tau$, but $\tau$ grows only by a~single character
+  per append operation.
+\endlist
+
+\>Now, all building blocks are ready and we can formulate the complete algorithm.
+
+\h{The full algorithm}
+
+\algo
+\:{\I Input:} $\alpha=\alpha(\sigma)$ represented by a~canonical reference pair $(\pi,\tau)$,
+    the suffix tree $T$ for~$\sigma$, back edges \<back>, and a~new symbol~$a$.
+\:We check if $\alpha a$ is present in the tree, and create it if needed:
+\::If $\tau=\varepsilon$: ($\alpha=\pi$ is an internal node)
+\:::If there is an edge from the node~$\pi$ whose label starts with~$a$, then $\alpha a$ is present.
+\:::If there is no such edge, it isn't present, so we create a~new open edge going from~$\pi$
+    to a~new leaf.
+\::Otherwise $\tau\ne\varepsilon$: ($\alpha$ is a~hidden node)
+\:::We find an edge from~$\pi$ whose label starts with~$\tau$ (by checking $\tau[0]$).
+\:::If $\tau$ in the label is followed by~$a$, then $\alpha a$ is present.
+\:::Otherwise, it isn't present, so we subdivide the edge by creating a~new node
+    such that the edge from~$\pi$ to the new node is labeled by~$\tau$. The new node
+    will have an~open edge to a~new child.
+\:If $\alpha a$ was not present, we remove the first character of~$\alpha$ and restart from step~2.
+\:Now, we know that $\alpha a$ is present, so we update the reference pair to represent $\alpha a$.
+\:We recalculate back edges (see below).
+\:{\I Output:} $\alpha=\alpha(\sigma a)$ as a~canonical reference pair $(\pi,\tau)$,
+  the suffix tree $T$ for~$\sigma a$, back edges \<back>.
+\endalgo
+
+\s{Back edges:}
+It remains to show how to make back edges of new internal nodes.
+We can observe that if we create a~node, this node corresponds to the current~$\alpha$
+and its back edge goes to $\alpha[1:{}]$ --- this is the node that we create (or discover
+that it already exists) in the next iteration of the main loop.
+During the next iteration, this back edge will not be needed yet, because $\vert\pi\vert$
+is always decreasing.
+Therefore we can delay creation of back edges by one iteration.
+This way, we create back edges in constant time per node.
+
+\references
+\bye
author	Martin Mares <mj@ucw.cz>
	Mon, 15 Jan 2024 21:08:34 +0000 (22:08 +0100)
committer	Martin Mares <mj@ucw.cz>
	Mon, 15 Jan 2024 21:08:34 +0000 (22:08 +0100)