10-suffix/ukkonen-en.tex

   1 \input ../sgr.tex
   2
   3 \ehyph
   4 \def\proof{\noindent {\sl Proof:} }
   5
   6 \prednaska{10b}{Ukkonen's Suffix Tree Algorithm}{}
   7
   8 \figure{baraba.epdf}{Suffixes of the word \uv{baraba}: trie, suffix tree, ST with a~dollar}{}
   9
  10 \s{Observation:} Consider a~suffix tree for a~word~$\sigma$. Nodes of the tree (including hidden nodes)
  11 correspond to prefixes of suffixes of~$\sigma$, that is the subwords of~$\sigma$. Leaves of the tree
  12 are suffixes, which occur nowhere else in~$\sigma$ --- these are called the {\I non-nested} suffixes.
  13 Internal nodes correspond to {\I branching subwords} --- subwords $\alpha\subseteq\sigma$ such that $\alpha a\subset\sigma$
  14 and $\alpha b\subseteq\sigma$ for two different symbols $a$ and~$b$.
  15
  16 \h{Ukkonen's algorithm}
  17
  18 \>Ukkonen described an algorithm \cite{ukkonen95line} for constructing the suffix tree without
  19 the terminating dollar. It is an incremental algorithm. It starts with a~tree for an empty word
  20 and it gradually appends new characters to the word, while updating the suffix tree. Each character
  21 is added in amortized constant time. Therefore, it constructs the ST for a~word~$\sigma$ in time
  22 $\O(\vert\sigma\vert)$.
  23
  24 We will assume that edges to children of a~single node can be indexed by their initial
  25 symbols --- this certainly holds if the alphabet is fixed and small. In case it is not, we can
  26 use hash tables instead of arrays.
  27
  28 \s{Observation:} When we extend a~word~$\sigma$ to~$\sigma a$, ST changes as follows:
  29
  30 \numlist\ndotted
  31 \:All existing nodes of the tree (including hidden ones) correspond to subwords of~$\sigma$.
  32   These are also subwords of~$\sigma a$, so they occur as nodes of the new tree, too.
  33 \:If $\beta$ was a~branching subword, it stays branching --- so internal nodes stay such.
  34 \:Each new suffix~$\beta a$ is obtained by extending some original suffix~$\beta$. Here:
  35   \itemize\ibull
  36   \:If $\beta$ was non-nested (so a~leaf), $\beta a$~will be also non-nested.
  37     So leaves stay leaves, but the labels of their edges must be extended by~$a$.
  38     To make this efficient, we introduce {\I open edges,} whose labels index~$\sigma$
  39     from a~given position to the end. So leaves will take care of themselves.
  40   \:If $\beta$ was a~nested suffix (i.e., an internal or hidden node), then:
  41     \itemize\ibull
  42       \:Either $\beta a$ is present in~$\sigma$. Then it is a~nested suffix of the new word
  43         and the tree needs no change;
  44       \:or $\beta a$ does not occur in~$\sigma$. Then we have to create a~new leaf
  45         with an~open edge and possibly a~new internal node, under which the new leave will be connected.
  46     \endlist
  47   \endlist
  48 \endlist
  49
  50 This describes how the tree changes when a~new symbol is appended to~$\sigma$.
  51 It remains to show how to do it efficiently.
  52
  53 \s{Nested suffixes:}
  54 First of all, we need to recognize which suffixes are nested and which are not.
  55 It helps that nested suffixes form an interval:
  56
  57 {\narrower
  58 \s{Lemma:} If $\alpha$ is a~nested suffix of a~word~$\sigma$ and $\beta$ is a~suffix of~$\alpha$,
  59 then $\beta$ is also a~nested suffix of~$\sigma$.
  60
  61 \proof
  62 The word $\sigma$ contains both $\alpha x$ and $\alpha y$ for some distinct symbols $x$ and~$y$.
  63 Since $\alpha$ ends with~$\beta$, there must be both $\beta x$ and $\beta y$ somewhere in~$\sigma$,
  64 so~$\beta$ is nested.
  65 \qed
  66
  67 }
  68
  69 \>It therefore suffices to maintain the {\I longest nested suffix} of the word~$\sigma$.
  70 We will all it the {\I active suffix} and denote it by $\alpha(\sigma)$. Each suffix $\beta\subseteq\sigma$
  71 is nested if and only if $\vert\beta\vert \le \vert\alpha(\sigma)\vert$.
  72
  73 The active suffix demarcates a~boundary between non-nested and nested suffixes.
  74 How does this boundary move when we extend $\sigma$ to $\sigma a$?
  75 The answer is simple:
  76
  77 {\narrower
  78 \s{Lemma:} For every $\sigma$ and~$a$: $\alpha(\sigma a)$ is a~suffix of $\alpha(\sigma)a.$
  79
  80 \proof
  81 Both $\alpha(\sigma a)$ and $\alpha(\sigma)a$ are suffixes of~$\sigma a$, so it suffices to compare
  82 their lengths. The word $\beta := \hbox{\uv{$\alpha(\sigma a)$ without the final~$a$}}$ is a~nested suffix
  83 of~$\sigma$, so $\vert\beta\vert \le \vert\alpha(\sigma)\vert$, which implies $\vert\alpha(\sigma a)\vert = \vert\beta a\vert \le \vert\alpha(\sigma)a\vert$.
  84 \qed
  85
  86 }
  87
  88 \>In other words, the boundary can only move to the right or stay as it was.
  89 This leads to the following idea.
  90
  91 \s{Sketch of algorithm:} We maintain $\alpha=\alpha(\sigma)$ and when appending
  92 a~new character~$a$, we check if $\alpha a$ stays nested. If it does, nothing changes.
  93 If it doesn't, we add a~new leaf and possibly also a~new internal node; then we remove
  94 the first character of~$\alpha$ and continue checking.
  95
  96 \s{Observation:} Appending a~character to~$\sigma$ causes an~amortized constant number
  97 of tree modifications. This holds because every modification shortens~$\alpha$, but $\alpha$
  98 grows only by a~single character per append to~$\sigma$. It remains to show how to
  99 perform each modification in (amortized) constant time. In order to accomplish that,
 100 we need a~suitable representation of the word~$\alpha$, which supports efficient
 101 appends, cuts of the first character, and tests of existence of the corresponding node
 102 in the tree.
 103
 104 \h{Reference pairs}
 105
 106 \s{Definition:} A~{\I reference pair} for a~word $\alpha\subseteq\sigma$ is a~pair
 107 $(\pi,\tau)$, where $\pi$ is a~tree node, $\tau$ an arbitrary word and $\pi\tau=\alpha$.
 108 Furthermore, we know that $\tau\subseteq\sigma$, so~$\tau$ can be represented by a~pair
 109 of indices in~$\sigma$.
 110
 111 A~reference pair is {\I canonical,} if there is no edge from the node~$\pi$ with a~label,
 112 which is a~prefix of~$\tau$. (Please note that such an edge can be found by just comparing
 113 the first character of its label with $\tau[0]$ and checking if the length of the label
 114 is at most~$\vert\tau\vert$. It is not necessary to check if the rest of the label matches~$\tau[1:{}]$.)
 115
 116 \s{Observation:} Each word $\alpha\subseteq\sigma$ has exactly one canonical reference pair
 117 representing it. (Among all reference pairs representing~$\alpha$, it is the one with
 118 the deepest possible node~$\pi$.)
 119
 120 \s{Definition:} The {\I back edge} $\<back>(\pi)$ leads from an internal node~$\pi$ to
 121 an internal node, which represents $\pi[1:{}]$. (Let us observe that such a~node must exist:
 122 every suffix of a~branching subword is also branching.)
 123
 124 \s{Operations:}
 125 We are going to represent the active suffix~$\alpha$ by a~reference pair $(\pi,\tau)$. We need
 126 to perform the following operations on the pair:
 127
 128 \itemize\ibull
 129 \:{\I Append a~symbol~$a$:} We append~$a$ to~$\tau$. We obtain a~reference pair
 130   for~$\alpha a$, but it is not necessarily canonical. Before appending, we can easily check
 131   if $\alpha$ is present in the tree.
 132 \:{\I Removing the first character:} If $\pi$ is not the tree root, we replace~$\pi$
 133   by $\<back>(\pi)$ and keep~$\tau$. Otherwise $\pi$ is the empty string, so we remove
 134   the first character from~$\tau$ (we just increment the corresponding start index).
 135 \:{\I Canonicalize:} The previous two operations can produce a~non-canonical pair.
 136   So we check if the pair is canonical: for $\tau\ne\varepsilon$, we test if there is an
 137   edge from~$\pi$ indexed by~$\tau[0]$ is short enough to be a~prefix of~$\tau$.
 138   If it is, we follow this edge, which makes $\pi$ longer and $\tau$ shorter; then we
 139   repeat the test. Let us see that this takes amortized constant time, because every
 140   step of canonicalization shortens~$\tau$, but $\tau$ grows only by a~single character
 141   per append operation.
 142 \endlist
 143
 144 \>Now, all building blocks are ready and we can formulate the complete algorithm.
 145
 146 \h{The full algorithm}
 147
 148 \algo
 149 \:{\I Input:} $\alpha=\alpha(\sigma)$ represented by a~canonical reference pair $(\pi,\tau)$,
 150     the suffix tree $T$ for~$\sigma$, back edges \<back>, and a~new symbol~$a$.
 151 \:We check if $\alpha a$ is present in the tree, and create it if needed:
 152 \::If $\tau=\varepsilon$: ($\alpha=\pi$ is an internal node)
 153 \:::If there is an edge from the node~$\pi$ whose label starts with~$a$, then $\alpha a$ is present.
 154 \:::If there is no such edge, it isn't present, so we create a~new open edge going from~$\pi$
 155     to a~new leaf.
 156 \::Otherwise $\tau\ne\varepsilon$: ($\alpha$ is a~hidden node)
 157 \:::We find an edge from~$\pi$ whose label starts with~$\tau$ (by checking $\tau[0]$).
 158 \:::If $\tau$ in the label is followed by~$a$, then $\alpha a$ is present.
 159 \:::Otherwise, it isn't present, so we subdivide the edge by creating a~new node
 160     such that the edge from~$\pi$ to the new node is labeled by~$\tau$. The new node
 161     will have an~open edge to a~new child.
 162 \:If $\alpha a$ was not present, we remove the first character of~$\alpha$ and restart from step~2.
 163 \:Now, we know that $\alpha a$ is present, so we update the reference pair to represent $\alpha a$.
 164 \:We recalculate back edges (see below).
 165 \:{\I Output:} $\alpha=\alpha(\sigma a)$ as a~canonical reference pair $(\pi,\tau)$,
 166   the suffix tree $T$ for~$\sigma a$, back edges \<back>.
 167 \endalgo
 168
 169 \s{Back edges:}
 170 It remains to show how to make back edges of new internal nodes.
 171 We can observe that if we create a~node, this node corresponds to the current~$\alpha$
 172 and its back edge goes to $\alpha[1:{}]$ --- this is the node that we create (or discover
 173 that it already exists) in the next iteration of the main loop.
 174 During the next iteration, this back edge will not be needed yet, because $\vert\pi\vert$
 175 is always decreasing.
 176 Therefore we can delay creation of back edges by one iteration.
 177 This way, we create back edges in constant time per node.
 178
 179 \references
 180 \bye