5 \chapter{Fine Details of Computation}
7 \section{Models and machines}
9 Traditionally, computer scientists use a~variety of computational models
10 for a~formalism in which their algorithms are stated. If we were studying
11 NP-completeness, we could safely assume that all the models are equivalent,
12 possibly up to polynomial slowdown which is negligible. In our case, the
13 differences between good and not-so-good algorithms are on a~much smaller
14 scale. In this chapter, we will replace the usual ``yardsticks'' by a~micrometer,
15 state our computation models carefully and develop a repertoire of basic
16 data structures taking advantage of the fine details of the models.
18 We would like to keep the formalism close enough to the reality of the contemporary
19 computers. This rules out Turing machines and similar sequentially addressed
20 models, but even the remaining models are subtly different. For example, some of them
21 allow indexing of arrays in constant time, while the others have no such operation
22 and arrays have to be emulated with pointer structures, requiring $\Omega(\log n)$
23 time to access a~single element of an~$n$-element array. It is hard to say which
24 way is superior --- while most ``real'' computers have instructions for constant-time
25 indexing, it seems to be physically impossible to fulfil this promise regardless of
26 the size of memory. Indeed, at the level of logical gates, the depth of the
27 actual indexing circuits is logarithmic.
29 In recent decades, most researchers in the area of combinatorial algorithms
30 have been considering two computational models: the Random Access Machine and the Pointer
31 Machine. The former one is closer to the programmer's view of a~real computer,
32 the latter one is slightly more restricted and ``asymptotically safe.''
33 We will follow this practice and study our algorithms in both models.
36 The \df{Random Access Machine (RAM)} is not a~single model, but rather a~family
37 of closely related models, sharing the following properties.
38 (See Cook and Reckhow \cite{cook:ram} for one of the usual formal definitions
39 and Hagerup \cite{hagerup:wordram} for a~thorough description of the differences
40 between the RAM variants.)
42 The \df{memory} of the model is represented by an~array of \df{memory cells}
43 addressed by non-negative integers, each of them containing a~single non-negative integer.
44 The \df{program} is a~sequence of \df{instructions} of two basic kinds: calculation
45 instructions and control instructions.
47 \df{Calculation instructions} have two source arguments and one destination
48 argument, each \df{argument} being either an~immediate constant (not available
49 as destination), a~directly addressed memory cell (specified by its number)
50 or an~indirectly addressed memory cell (its address is stored in a~directly
51 addressed memory cell).
53 \df{Control instructions} include branches (to a~specific instruction in
54 the program), conditional branches (e.g., jump if two arguments specified as
55 in the calculation instructions are equal) and an~instruction to halt the program.
57 At the beginning of the computation, the memory contains the input data
58 in specified memory cells and arbitrary values in all other cells.
59 Then the program is executed one instruction at a~time. When it halts,
60 specified memory cells are interpreted as the program's output.
63 In the description of the RAM family, we have omitted several properties
64 on~purpose, because different members of the family define them differently.
65 These are: the size of the available integers, the time complexity of a~single
66 instruction, the space complexity of a~single memory cell and the repertoire
67 of operations available in calculation instructions.
69 If we impose no limits on the magnitude of the numbers and we assume that
70 arithmetic and logical operations work on them in constant time, we get
71 a~very powerful parallel computer --- we can emulate an~exponential number
72 of parallel processors using arithmetics and suddenly almost everything can be
73 computed in constant time, modulo encoding and decoding of input and output.
74 Such models are unrealistic and there are two basic possibilities how to
78 \:Keep unbounded numbers, but increase costs of instructions: each instruction
79 consumes time proportional to the number of bits of the numbers it processes,
80 including memory addresses. Similarly, space usage is measured in bits,
81 counting not only the values, but also the addresses of the respective memory
83 \:Place a~limit on the size of the numbers ---define the \df{word size~$W$,}
84 the number of bits available in the memory cells--- and keep the cost of
85 of instructions and memory cells constant. The word size must not be constant,
86 since we can address only~$2^W$ cells of memory. If the input of the algorithm
87 is stored in~$N$ cells, we need~$W\ge\log N$ just to be able to read the input.
88 On the other hand, we are interested in polynomial-time algorithms only, so $\Theta(\log N)$-bit
89 numbers should be sufficient. In practice, we pick~$w$ to be the larger of
90 $\Theta(\log N)$ and the size of integers used in the algorithm's input and output.
91 We will call an integer stored in a~single memory cell a~\df{machine word.}
94 Both restrictions easily avoid the problems of unbounded parallelism. The first
95 choice is theoretically cleaner and Cook et al.~show nice correspondences to the
96 standard complexity classes, but the calculations of time and space complexity tend
97 to be somewhat tedious. What more, when compared with the RAM with restricted
98 word size, the complexities are usually exactly $\Theta(w)$ times higher.
99 This does not hold in general (consider a~program which uses many small numbers
100 and $\O(1)$ large ones), but it is true for the algorithms we are interested in.
101 Therefore we will always assume that the operations have unit cost and we make
102 sure that all numbers are limited by the available word size.
105 As for the choice of RAM operations, the following three instruction sets are often used:
108 \:\df{Word-RAM} --- allows the ``C-language operators'', i.e., addition,
109 subtraction, multiplication, division, remainder, bitwise {\sc and,} {\sc or,} exclusive
110 {\sc or} ({\sc xor}), negation ({\sc not}) and bitwise shifts ($\ll$ and~$\gg$).
111 \:\df{${\rm AC}^0$-RAM} --- allows all operations from the class ${\rm AC}^0$, i.e.,
112 those computable by constant-depth polynomial-size boolean circuits with unlimited
113 fan-in and fan-out. This includes all operations of the Word-RAM except for multiplication,
114 division and remainders, and also many other operations like computing the Hamming
115 weight (number of bits set in a~given number).
116 \:Both restrictions at once.
119 Thorup discusses the usual techniques employed by RAM algorithms in~\cite{thorup:aczero}
120 and he shows that they work on both Word-RAM and ${\rm AC}^0$-RAM, but the combination
121 of the two restrictions is too weak. On the other hand, the intersection of~${\rm AC}^0$
122 with the instruction set of modern processors (e.g., adding some floating-point
123 operations and multimedia instructions available on the Intel's Pentium~4~\cite{intel:pentium}) is already strong enough.
125 We will therefore use the Word-RAM instruction set, mentioning differences from the
126 ${\rm AC}^0$-RAM where necessary.
129 When speaking of the \df{RAM,} we implicitly mean the version with numbers limited
130 by a~specified word size of $W$~bits, unit cost of operations and memory cells and the instruction
131 set of the Word-RAM. This corresponds to the usage in recent algorithmic literature,
132 although the authors rarely mention the details. In some cases, a~non-uniform variant
133 of the Word-RAM is considered as well (e.g., in~\cite{hagerup:dd}):
136 A~Word-RAM is called \df{weakly non-uniform,} if it is equipped with $\O(1)$-time
137 access to a~constant number of word-sized constants, which depend only on the word
138 size. These are called \df{native constants} and they are available in fixed memory
139 cells when the program starts. (By analogy with the high-level programming languages,
140 these constants can be thought of as computed at ``compile time.'')
143 The \df{Pointer Machine (PM)} also does not have any well established definition. The
144 various kinds of pointer machines are mapped by Ben-Amram in~\cite{benamram:pm},
145 but unlike the RAM's they turn out to be equivalent up to constant slowdown.
146 Our definition will be closely related to the \em{linking automaton} proposed
147 by Knuth in~\cite{knuth:fundalg}, we will only adapt it to use RAM-like
148 instructions instead of an~opaque control unit.
150 The PM works with two different types of data: \df{symbols} from a~finite alphabet
151 and \df{pointers}. The memory of the machine consists of a~fixed amount of \df{registers}
152 (some of them capable of storing a~single symbol, each of the others holds a~single pointer)
153 and an~arbitrary amount of \df{cells}. The structure of all cells is the same: each of them
154 again contains a~fixed number of fields for symbols and pointers. Registers can be addressed
155 directly, the cells only via pointers --- either by using a~pointer stored in a~register,
156 or in a~cell pointed to by a~register (longer chains of pointers cannot be followed in
159 We can therefore view the whole memory as a~directed graph, whose vertices
160 correspond to the cells (the registers are stored in a~single special cell).
161 The outgoing edges of each vertex correspond to pointer fields and they are
162 labelled with distinct labels drawn from a~finite set. In addition to that,
163 each vertex contains a~fixed amount of symbols. The program can directly access
164 vertices within distance~2 from the register vertex.
166 The program is a~sequence of instructions of the following kinds:
169 \:\df{symbol instructions,} which read a~pair of symbols, apply an~arbitrary
170 function on them and write the result to a~symbol register or field;
171 \:\df{pointer instructions} for assignment of pointers to pointer registers/fields
172 and for creation of new memory cells (a~pointer to the new cell is assigned
174 \:\df{control instructions} --- similarly to the RAM; conditional jumps can decide
175 on~arbitrary unary relations on symbols and compare pointers for equality.
178 Time and space complexity are defined in the straightforward way: all instructions
179 have unit cost and so do all memory cells.
181 Both input and output of the machine are passed in the form of a~linked structure
182 pointed to by a~designated register. For example, we can pass graphs back and forth
183 without having to encode them as strings of numbers or symbols. This is important,
184 because with the finite alphabet of the~PM, all symbolic representations of graphs
185 require super-linear space and therefore also time.
188 Compared to the RAM, the PM lacks two important capabilities: indexing of arrays
189 and arithmetic instructions. We can emulate both with poly-logarithmic slowdown,
190 but it will turn out that they are rarely needed in graph algorithms. We are
191 also going to prove that the RAM is strictly stronger, so we will prefer to
192 formulate our algorithms in the PM model and use RAM only when necessary.
195 Every program for the Word-RAM with word size~$W$ can be translated to a~program
196 computing the same\foot{Given a~suitable encoding of inputs and outputs, of course.}
197 on the~PM with $\O(W^2)$ slowdown. If the RAM program does not
198 use multiplication, division and remainder operations, $\O(W)$~slowdown
202 Represent the memory of the RAM by a~balanced binary search tree or by a~radix
203 trie of depth~$\O(W)$. Values are encoded as~linked lists of symbols pointed
204 to by the nodes of the tree. Both direct and indirect accesses to the memory
205 can therefore be done in~$\O(W)$ time. Use standard algorithms for arithmetic
206 on big numbers: $\O(W)$ per operation except for multiplication, division and
207 remainders which take $\O(W^2)$.\foot{We could use more efficient arithmetic
208 algorithms, but the quadratic bound it good enough for our purposes.}
211 \FIXME{Add references.}
214 Every program for the PM running in polynomial time can be translated to a~program
215 computing the same on the Word-RAM with only $\O(1)$ slowdown.
218 Encode each cell of the PM's memory to $\O(1)$ integers. Store the encoded cells to
219 memory of the RAM sequentially and use memory addresses as pointers. As the symbols
220 are finite and there is only a~polynomial number of cells allocated during execution
221 of the program, $\O(\log N)$-bit integers suffice.
225 There are also \df{randomized} versions of both machines. These are equipped
226 with an~additional instruction for generating a~single random bit. The standard
227 techniques of design and analysis of randomized algorithms then apply (see for
228 example Motwani and Raghavan~\cite{motwani:randalg}).
230 \FIXME{Consult sources. Does it make more sense to generate random words at once on the RAM?}
233 There is one more interesting machine: the \df{Immutable Pointer Machine} (see
234 the description of LISP machines in \cite{benamram:pm}). It differs from the
235 ordinary PM by the inability to modify existing memory cells. Only the contents
236 of the registers are allowed to change. All cell modifications thus have to
237 be performed by creating a~copy of the particular cell with some fields changed.
238 This in turn requires the pointers to the cell to be updated, possibly triggering
239 a~cascade of cell copies. For example, when a~node of a~binary search tree is
240 updated, all nodes on the path from that node to the root have to be copied.
242 One of the advantages of this model is that the states of the machine are
243 persistent --- it is possible to return to a~previously visited state by recalling
244 the $\O(1)$ values of the registers (everything else could not have changed
245 since that time) and ``fork'' the computations. This corresponds to the semantics
246 of pure functional languages, e.g., Haskell.
248 Unless we are willing to accept a~logarithmic penalty in execution time and space
249 (in fact, our emulation of the Word-RAM on the PM can be easily made immutable),
250 the design of efficient algorithms for the immutable PM requires very different
251 techniques. Therefore, we will concentrate on the imperative models instead
252 and refer the interested reader to the thorough treatment of purely functional
253 data structures in the Okasaki's monograph~\cite{okasaki:funcds}.
255 %--------------------------------------------------------------------------------
257 \section{Bucket sorting and contractions}\id{bucketsort}%
259 The Contractive Bor\o{u}vka's algorithm (\ref{contbor}) needed to contract a~given
260 set of edges in the current graph and flatten it afterwards, all in time $\O(m)$.
261 We have spared the technical details for this section and they will be useful
262 in further algorithms, too.
264 As already suggested, the contractions can be performed by building an~auxiliary
265 graph and finding its connected components, so we will take care of the flattening
269 On the RAM, it is sufficient to sort the edges lexicographically (each edge viewed
270 as an~ordered pair of vertex identifiers with the smaller of the identifiers placed
271 first). We can do that by a two-pass bucket sort with~$n$ buckets corresponding
272 to the vertex identifiers.
274 However, there is a~catch in this. Suppose that we use the standard representation
275 of graphs as adjacency lists whose heads are stored in an array indexed by vertex
276 identifiers. When we contract and flatten the graph, the number of vertices decreases,
277 but if we inherit the original vertex identifiers, the arrays will still have the
278 same size. Hence we spend a~super-linear amount of time on scanning the arrays,
279 most of the time skipping unused entries.
281 To avoid this, we just renumber the vertices after each contraction to component
282 identifiers from the auxiliary graph and we create a~new vertex array. This way,
283 the representation of the graph will be linear with respect to the size of the
287 The pointer representation of graphs does not suffer from sparsity as the vertices
288 are always identified by pointers to per-vertex structures. Each such structure
289 then contains all attributes associated with the vertex, including the head of the
290 adjacency list. However, we have to find a~way how to perform bucket sorting
293 We will keep a~list of the per-vertex structures which defines the order on~vertices.
294 Each such structure will contain a~pointer to the head of the corresponding bucket,
295 again stored as a~list. Putting an~edge to a~bucket can be done in constant time then,
296 scanning all~$n$ buckets takes $\O(n+m)$ time.
298 \FIXME{Add an example of locally determined orders, e.g., tree isomorphism?}
300 %--------------------------------------------------------------------------------
302 \section{Data structures on the RAM}
304 There is a~lot of data structures designed specifically for the RAM, taking
305 advantage of both indexing and arithmetics. In many cases, they surpass the known
306 lower bounds for the same problem on the~PM and often achieve constant time
307 per operation, at least when either the magnitude of the values or the size of
308 the data structure are suitably bounded.
310 A~classical result of this type are the trees of van Emde Boas~\cite{boas:vebt},
311 which represent a~subset of the integers $\{0,\ldots,U-1\}$, allowing insertion,
312 deletion and order operations (minimum, maximum, successor etc.) in time $\O(\log\log U)$,
313 regardless of the size of the subset. If we plug this structure in the Jarn\'\i{}k's
314 algorithm (\ref{jarnik}), replacing the heap, we immediately get an~algorithm
315 for finding the MST in integer-weighted graphs in time $\O(m\log\log w_{max})$,
316 where $w_{max}$ is the maximum weight. We will show later that it is even
317 possible to achieve linear time complexity for arbitrary integer weights.
319 A~real breakthrough has been made by Fredman and Willard, who introduced
320 the Fusion trees~\cite{fw:fusion} which again perform membership and predecessor
321 operation on a~set of integers, but this time with complexity $\O(\log_W n)$
322 per operation on a~Word-RAM with $W$-bit words. This of course assumes that
323 each element of the set fits in a~single word. As $W$ is at least~$\log n$,
324 the operations take $\O(\log n/\log\log n)$ and we are able to sort integers
325 in time~$o(n\log n)$. This was a~beginning of a~long sequence of faster and
326 faster integer sorting algorithms, culminating with the work by Thorup and Han.
327 They have improved the time complexity to $\O(n\log\log n)$ deterministically~\cite{han:detsort}
328 and expected $\O(n\sqrt{\log\log n})$ for randomized algorithms~\cite{hanthor:randsort},
329 both in linear space.
331 Despite the recent progress, the corner-stone of most RAM data structures
332 is still the representation of data structures by integers introduced by Fredman
333 and Willard and it will also form a~basis for the rest of this chapter.
335 \FIXME{Add more history.}
337 %--------------------------------------------------------------------------------
339 \section{Bits and vectors}
341 In this rather technical section, we will show how RAM can be used as a~vector
342 computer to operate in parallel on multiple elements, as long as these elements
343 fit in a~single machine word. On the first sight this might seem useless, as we
344 cannot require large word sizes, but surprisingly often the elements are small
345 enough relative to the size of the algorithm's input and thus also relative to
346 the minimum possible word size. Also, as the following lemma shows, we can
347 easily emulate constant-times longer words:
349 \lemman{Multiple-precision calculations}
350 Given a~RAM with $W$-bit words, we can emulate all calculation and control
351 instructions of a~RAM with word size $kW$ in time depending only on the~$k$.
352 (This is usually called \df{multiple-precision arithmetics.})
355 We split each word of the ``big'' machine to $W'=W/2$-bit blocks and store these
356 blocks in $2k$ consecutive memory cells. Addition, subtraction, comparison,
357 bitwise logical operations can be performed block-by-block. Shifts by a~multiple
358 of~$W'$ are trivial, otherwise we can combine each block of the result from
359 shifted versions of two original blocks.
360 To multiply two numbers, we can use the elementary school algorithm using the $W'$-bit
361 blocks as digits in base $2^{W'}$ --- the product of any two blocks fits
364 Division is harder, but Newton-Raphson iteration (see~\cite{ito:newrap})
365 converges to the quotient in a~constant number of iterations, each of them
366 involving $\O(1)$ multiple-precision additions and multiplications. A~good
367 starting approximation can be obtained by dividing the two most-significant
368 (non-zero) blocks of both numbers.
370 Another approach to division is using the improved elementary school algorithm as described
371 by Knuth in~\cite{knuth:seminalg}. It uses $\O(k^2)$ steps, but the steps involve
372 calculation of the most significant bit set in a~word. We will show below that it
373 can be done in constant time, but we have to be careful to avoid division instructions.
377 We will use boldface letters for vectors. The components of a~vector~${\bf x}$
378 will be denoted as $x_0,\ldots,x_{d-1}$.
380 \notan{Bit strings}\id{bitnota}%
381 We will work with binary representations of natural numbers as strings over the
382 alphabet $\{\0,\1\}$: we will use $\(x)$ for the number~$x$ written in binary,
383 $\(x)_b$ for the same padded to exactly $b$ bits by adding zeroes at the beginning.
384 Strings will be denoted by Greek letters and the usual conventions will be utilized
385 for string operations: $\alpha\beta$ for concatenation and $\alpha^k$ for the
386 string~$\alpha$ repeated $k$~times. When the meaning is clear from the context,
387 we will use $x$ and $\(x)$ interchangeably to avoid outbreak of symbols.
390 The \df{bitwise encoding} of a~vector ${\bf x}=(x_0,\ldots,x_{d-1})$ of~$b$-bit numbers
391 is an~integer $\0\(x_{d-1})_b\0\(x_{d-2})_b\0\ldots\0\(x_0)_b$.
394 If we wish for the whole vector to fit in a~single word, we need $(b+1)d\le W$.
395 By using multiple-precision arithmetics, we can encode all vectors satisfying $bd=\O(W)$.
396 We will now describe how to translate simple vector manipulations to RAM operations on their