ram.tex

   1 \ifx\endpart\undefined
   2 \input macros.tex
   3 \fi
   4
   5 \chapter{Fine Details of Computation}
   6
   7 \section{Models and machines}
   8
   9 Traditionally, computer scientists use a~variety of computational models
  10 for a~formalism in which their algorithms are stated. If we were studying
  11 NP-completeness, we could safely assume that all the models are equivalent,
  12 possibly up to polynomial slowdown which is negligible. In our case, the
  13 differences between good and not-so-good algorithms are on a~much smaller
  14 scale. In this chapter, we will replace the usual ``yardsticks'' by a~micrometer,
  15 state our computation models carefully and develop a repertoire of basic
  16 data structures taking advantage of the fine details of the models.
  17
  18 We would like to keep the formalism close enough to the reality of the contemporary
  19 computers. This rules out Turing machines and similar sequentially addressed
  20 models, but even the remaining models are subtly different. For example, some of them
  21 allow indexing of arrays in constant time, while the others have no such operation
  22 and arrays have to be emulated with pointer structures, requiring $\Omega(\log n)$
  23 time to access a~single element of an~$n$-element array. It is hard to say which
  24 way is superior --- while most ``real'' computers have instructions for constant-time
  25 indexing, it seems to be physically impossible to fulfil this promise regardless of
  26 the size of memory. Indeed, at the level of logical gates, the depth of the
  27 actual indexing circuits is logarithmic.
  28
  29 In recent decades, most researchers in the area of combinatorial algorithms
  30 have been considering two computational models: the Random Access Machine and the Pointer
  31 Machine. The former one is closer to the programmer's view of a~real computer,
  32 the latter one is slightly more restricted and ``asymptotically safe.''
  33 We will follow this practice and study our algorithms in both models.
  34
  35 \para
  36 The \df{Random Access Machine (RAM)} is not a~single model, but rather a~family
  37 of closely related models, sharing the following properties.
  38 (See Cook and Reckhow \cite{cook:ram} for one of the usual formal definitions
  39 and Hagerup \cite{hagerup:wordram} for a~thorough description of the differences
  40 between the RAM variants.)
  41
  42 The \df{memory} of the model is represented by an~array of \df{memory cells}
  43 addressed by non-negative integers, each of them containing a~single non-negative integer.
  44 The \df{program} is a~sequence of \df{instructions} of two basic kinds: calculation
  45 instructions and control instructions.
  46
  47 \df{Calculation instructions} have two source arguments and one destination
  48 argument, each \df{argument} being either an~immediate constant (not available
  49 as destination), a~directly addressed memory cell (specified by its number)
  50 or an~indirectly addressed memory cell (its address is stored in a~directly
  51 addressed memory cell).
  52
  53 \df{Control instructions} include branches (to a~specific instruction in
  54 the program), conditional branches (e.g., jump if two arguments specified as
  55 in the calculation instructions are equal) and an~instruction to halt the program.
  56
  57 At the beginning of the computation, the memory contains the input data
  58 in specified memory cells and arbitrary values in all other cells.
  59 Then the program is executed one instruction at a~time. When it halts,
  60 specified memory cells are interpreted as the program's output.
  61
  62 \para\id{wordsize}%
  63 In the description of the RAM family, we have omitted several properties
  64 on~purpose, because different members of the family define them differently.
  65 These are: the size of the available integers, the time complexity of a~single
  66 instruction, the space complexity of a~single memory cell and the repertoire
  67 of operations available in calculation instructions.
  68
  69 If we impose no limits on the magnitude of the numbers and we assume that
  70 arithmetic and logical operations work on them in constant time, we get
  71 a~very powerful parallel computer --- we can emulate an~exponential number
  72 of parallel processors using arithmetics and suddenly almost everything can be
  73 computed in constant time, modulo encoding and decoding of input and output.
  74 Such models are unrealistic and there are two basic possibilities how to
  75 avoid this behavior:
  76
  77 \numlist\ndotted
  78 \:Keep unbounded numbers, but increase costs of instructions: each instruction
  79   consumes time proportional to the number of bits of the numbers it processes,
  80   including memory addresses. Similarly, space usage is measured in bits,
  81   counting not only the values, but also the addresses of the respective memory
  82   cells.
  83 \:Place a~limit on the size of the numbers ---define the \df{word size~$W$,}
  84   the number of bits available in the memory cells--- and keep the cost of
  85   of instructions and memory cells constant. The word size must not be constant,
  86   since we can address only~$2^W$ cells of memory. If the input of the algorithm
  87   is stored in~$N$ cells, we need~$W\ge\log N$ just to be able to read the input.
  88   On the other hand, we are interested in polynomial-time algorithms only, so $\Theta(\log N)$-bit
  89   numbers should be sufficient. In practice, we pick~$w$ to be the larger of
  90   $\Theta(\log N)$ and the size of integers used in the algorithm's input and output.
  91   We will call an integer stored in a~single memory cell a~\df{machine word.}
  92 \endlist
  93
  94 Both restrictions easily avoid the problems of unbounded parallelism. The first
  95 choice is theoretically cleaner and Cook et al.~show nice correspondences to the
  96 standard complexity classes, but the calculations of time and space complexity tend
  97 to be somewhat tedious. What more, when compared with the RAM with restricted
  98 word size, the complexities are usually exactly $\Theta(w)$ times higher.
  99 This does not hold in general (consider a~program which uses many small numbers
 100 and $\O(1)$ large ones), but it is true for the algorithms we are interested in.
 101 Therefore we will always assume that the operations have unit cost and we make
 102 sure that all numbers are limited by the available word size.
 103
 104 \para
 105 As for the choice of RAM operations, the following three instruction sets are often used:
 106
 107 \itemize\ibull
 108 \:\df{Word-RAM} --- allows the ``C-language operators'', i.e., addition,
 109   subtraction, multiplication, division, remainder, bitwise {\sc and,} {\sc or,} exclusive
 110   {\sc or} ({\sc xor}), negation ({\sc not}) and bitwise shifts ($\ll$ and~$\gg$).
 111 \:\df{${\rm AC}^0$-RAM} --- allows all operations from the class ${\rm AC}^0$, i.e.,
 112   those computable by constant-depth polynomial-size boolean circuits with unlimited
 113   fan-in and fan-out. This includes all operations of the Word-RAM except for multiplication,
 114   division and remainders, and also many other operations like computing the Hamming
 115   weight (number of bits set in a~given number).
 116 \:Both restrictions at once.
 117 \endlist
 118
 119 Thorup discusses the usual techniques employed by RAM algorithms in~\cite{thorup:aczero}
 120 and he shows that they work on both Word-RAM and ${\rm AC}^0$-RAM, but the combination
 121 of the two restrictions is too weak. On the other hand, the intersection of~${\rm AC}^0$
 122 with the instruction set of modern processors (e.g., adding some floating-point
 123 operations and multimedia instructions available on the Intel's Pentium~4~\cite{intel:pentium}) is already strong enough.
 124
 125 We will therefore use the Word-RAM instruction set, mentioning differences from the
 126 ${\rm AC}^0$-RAM where necessary.
 127
 128 \nota
 129 When speaking of the \df{RAM,} we implicitly mean the version with numbers limited
 130 by a~specified word size of $W$~bits, unit cost of operations and memory cells and the instruction
 131 set of the Word-RAM. This corresponds to the usage in recent algorithmic literature,
 132 although the authors rarely mention the details. In some cases, a~non-uniform variant
 133 of the Word-RAM is considered as well (e.g., in~\cite{hagerup:dd}):
 134
 135 \defn
 136 A~Word-RAM is called \df{weakly non-uniform,} if it is equipped with $\O(1)$-time
 137 access to a~constant number of word-sized constants, which depend only on the word
 138 size. These are called \df{native constants} and they are available in fixed memory
 139 cells when the program starts. (By analogy with the high-level programming languages,
 140 these constants can be thought of as computed at ``compile time.'')
 141
 142 \para
 143 The \df{Pointer Machine (PM)} also does not have any well established definition. The
 144 various kinds of pointer machines are mapped by Ben-Amram in~\cite{benamram:pm},
 145 but unlike the RAM's they turn out to be equivalent up to constant slowdown.
 146 Our definition will be closely related to the \em{linking automaton} proposed
 147 by Knuth in~\cite{knuth:fundalg}, we will only adapt it to use RAM-like
 148 instructions instead of an~opaque control unit.
 149
 150 The PM works with two different types of data: \df{symbols} from a~finite alphabet
 151 and \df{pointers}. The memory of the machine consists of a~fixed amount of \df{registers}
 152 (some of them capable of storing a~single symbol, each of the others holds a~single pointer)
 153 and an~arbitrary amount of \df{cells}. The structure of all cells is the same: each of them
 154 again contains a~fixed number of fields for symbols and pointers. Registers can be addressed
 155 directly, the cells only via pointers --- either by using a~pointer stored in a~register,
 156 or in a~cell pointed to by a~register (longer chains of pointers cannot be followed in
 157 constant time).
 158
 159 We can therefore view the whole memory as a~directed graph, whose vertices
 160 correspond to the cells (the registers are stored in a~single special cell).
 161 The outgoing edges of each vertex correspond to pointer fields and they are
 162 labelled with distinct labels drawn from a~finite set. In addition to that,
 163 each vertex contains a~fixed amount of symbols. The program can directly access
 164 vertices within distance~2 from the register vertex.
 165
 166 The program is a~sequence of instructions of the following kinds:
 167
 168 \itemize\ibull
 169 \:\df{symbol instructions,} which read a~pair of symbols, apply an~arbitrary
 170   function on them and write the result to a~symbol register or field;
 171 \:\df{pointer instructions} for assignment of pointers to pointer registers/fields
 172   and for creation of new memory cells (a~pointer to the new cell is assigned
 173   immediately);
 174 \:\df{control instructions} --- similarly to the RAM; conditional jumps can decide
 175   on~arbitrary unary relations on symbols and compare pointers for equality.
 176 \endlist
 177
 178 Time and space complexity are defined in the straightforward way: all instructions
 179 have unit cost and so do all memory cells.
 180
 181 Both input and output of the machine are passed in the form of a~linked structure
 182 pointed to by a~designated register. For example, we can pass graphs back and forth
 183 without having to encode them as strings of numbers or symbols. This is important,
 184 because with the finite alphabet of the~PM, all symbolic representations of graphs
 185 require super-linear space and therefore also time.
 186
 187 \para
 188 Compared to the RAM, the PM lacks two important capabilities: indexing of arrays
 189 and arithmetic instructions. We can emulate both with poly-logarithmic slowdown,
 190 but it will turn out that they are rarely needed in graph algorithms. We are
 191 also going to prove that the RAM is strictly stronger, so we will prefer to
 192 formulate our algorithms in the PM model and use RAM only when necessary.
 193
 194 \thm
 195 Every program for the Word-RAM with word size~$W$ can be translated to a~program
 196 computing the same\foot{Given a~suitable encoding of inputs and outputs, of course.}
 197 on the~PM with $\O(W^2)$ slowdown. If the RAM program does not
 198 use multiplication, division and remainder operations, $\O(W)$~slowdown
 199 is sufficient.
 200
 201 \proofsketch
 202 Represent the memory of the RAM by a~balanced binary search tree or by a~radix
 203 trie of depth~$\O(W)$. Values are encoded as~linked lists of symbols pointed
 204 to by the nodes of the tree. Both direct and indirect accesses to the memory
 205 can therefore be done in~$\O(W)$ time. Use standard algorithms for arithmetic
 206 on big numbers: $\O(W)$ per operation except for multiplication, division and
 207 remainders which take $\O(W^2)$.\foot{We could use more efficient arithmetic
 208 algorithms, but the quadratic bound it good enough for our purposes.}
 209 \qed
 210
 211 \FIXME{Add references.}
 212
 213 \thm
 214 Every program for the PM running in polynomial time can be translated to a~program
 215 computing the same on the Word-RAM with only $\O(1)$ slowdown.
 216
 217 \proofsketch
 218 Encode each cell of the PM's memory to $\O(1)$ integers. Store the encoded cells to
 219 memory of the RAM sequentially and use memory addresses as pointers. As the symbols
 220 are finite and there is only a~polynomial number of cells allocated during execution
 221 of the program, $\O(\log N)$-bit integers suffice.
 222 \qed
 223
 224 \para
 225 There are also \df{randomized} versions of both machines. These are equipped
 226 with an~additional instruction for generating a~single random bit. The standard
 227 techniques of design and analysis of randomized algorithms then apply (see for
 228 example Motwani and Raghavan~\cite{motwani:randalg}).
 229
 230 \FIXME{Consult sources. Does it make more sense to generate random words at once on the RAM?}
 231
 232 \rem
 233 There is one more interesting machine: the \df{Immutable Pointer Machine} (see
 234 the description of LISP machines in \cite{benamram:pm}). It differs from the
 235 ordinary PM by the inability to modify existing memory cells. Only the contents
 236 of the registers are allowed to change. All cell modifications thus have to
 237 be performed by creating a~copy of the particular cell with some fields changed.
 238 This in turn requires the pointers to the cell to be updated, possibly triggering
 239 a~cascade of cell copies. For example, when a~node of a~binary search tree is
 240 updated, all nodes on the path from that node to the root have to be copied.
 241
 242 One of the advantages of this model is that the states of the machine are
 243 persistent --- it is possible to return to a~previously visited state by recalling
 244 the $\O(1)$ values of the registers (everything else could not have changed
 245 since that time) and ``fork'' the computations. This corresponds to the semantics
 246 of pure functional languages, e.g., Haskell.
 247
 248 Unless we are willing to accept a~logarithmic penalty in execution time and space
 249 (in fact, our emulation of the Word-RAM on the PM can be easily made immutable),
 250 the design of efficient algorithms for the immutable PM requires very different
 251 techniques. Therefore, we will concentrate on the imperative models instead
 252 and refer the interested reader to the thorough treatment of purely functional
 253 data structures in the Okasaki's monograph~\cite{okasaki:funcds}.
 254
 255 %--------------------------------------------------------------------------------
 256
 257 \section{Bucket sorting and contractions}\id{bucketsort}%
 258
 259 The Contractive Bor\o{u}vka's algorithm (\ref{contbor}) needed to contract a~given
 260 set of edges in the current graph and flatten it afterwards, all in time $\O(m)$.
 261 We have spared the technical details for this section and they will be useful
 262 in further algorithms, too.
 263
 264 As already suggested, the contractions can be performed by building an~auxiliary
 265 graph and finding its connected components, so we will take care of the flattening
 266 only.
 267
 268 \para
 269 On the RAM, it is sufficient to sort the edges lexicographically (each edge viewed
 270 as an~ordered pair of vertex identifiers with the smaller of the identifiers placed
 271 first). We can do that by a two-pass bucket sort with~$n$ buckets corresponding
 272 to the vertex identifiers.
 273
 274 However, there is a~catch in this. Suppose that we use the standard representation
 275 of graphs as adjacency lists whose heads are stored in an array indexed by vertex
 276 identifiers. When we contract and flatten the graph, the number of vertices decreases,
 277 but if we inherit the original vertex identifiers, the arrays will still have the
 278 same size. Hence we spend a~super-linear amount of time on scanning the arrays,
 279 most of the time skipping unused entries.
 280
 281 To avoid this, we just renumber the vertices after each contraction to component
 282 identifiers from the auxiliary graph and we create a~new vertex array. This way,
 283 the representation of the graph will be linear with respect to the size of the
 284 current graph.
 285
 286 \para
 287 The pointer representation of graphs does not suffer from sparsity as the vertices
 288 are always identified by pointers to per-vertex structures. Each such structure
 289 then contains all attributes associated with the vertex, including the head of the
 290 adjacency list. However, we have to find a~way how to perform bucket sorting
 291 without arrays.
 292
 293 We will keep a~list of the per-vertex structures which defines the order on~vertices.
 294 Each such structure will contain a~pointer to the head of the corresponding bucket,
 295 again stored as a~list. Putting an~edge to a~bucket can be done in constant time then,
 296 scanning all~$n$ buckets takes $\O(n+m)$ time.
 297
 298 \FIXME{Add an example of locally determined orders, e.g., tree isomorphism?}
 299
 300 %--------------------------------------------------------------------------------
 301
 302 \section{Data structures on the RAM}
 303
 304 There is a~lot of data structures designed specifically for the RAM, taking
 305 advantage of both indexing and arithmetics. In many cases, they surpass the known
 306 lower bounds for the same problem on the~PM and often achieve constant time
 307 per operation, at least when either the magnitude of the values or the size of
 308 the data structure are suitably bounded.
 309
 310 A~classical result of this type are the trees of van Emde Boas~\cite{boas:vebt},
 311 which represent a~subset of the integers $\{0,\ldots,U-1\}$, allowing insertion,
 312 deletion and order operations (minimum, maximum, successor etc.) in time $\O(\log\log U)$,
 313 regardless of the size of the subset. If we plug this structure in the Jarn\'\i{}k's
 314 algorithm (\ref{jarnik}), replacing the heap, we immediately get an~algorithm
 315 for finding the MST in integer-weighted graphs in time $\O(m\log\log w_{max})$,
 316 where $w_{max}$ is the maximum weight. We will show later that it is even
 317 possible to achieve linear time complexity for arbitrary integer weights.
 318
 319 A~real breakthrough has been made by Fredman and Willard, who introduced
 320 the Fusion trees~\cite{fw:fusion} which again perform membership and predecessor
 321 operation on a~set of integers, but this time with complexity $\O(\log_W n)$
 322 per operation on a~Word-RAM with $W$-bit words. This of course assumes that
 323 each element of the set fits in a~single word. As $W$ is at least~$\log n$,
 324 the operations take $\O(\log n/\log\log n)$ and we are able to sort integers
 325 in time~$o(n\log n)$. This was a~beginning of a~long sequence of faster and
 326 faster integer sorting algorithms, culminating with the work by Thorup and Han.
 327 They have improved the time complexity to $\O(n\log\log n)$ deterministically~\cite{han:detsort}
 328 and expected $\O(n\sqrt{\log\log n})$ for randomized algorithms~\cite{hanthor:randsort},
 329 both in linear space.
 330
 331 Despite the recent progress, the corner-stone of most RAM data structures
 332 is still the representation of data structures by integers introduced by Fredman
 333 and Willard and it will also form a~basis for the rest of this chapter.
 334
 335 \FIXME{Add more history.}
 336
 337 %--------------------------------------------------------------------------------
 338
 339 \section{Bits and vectors}
 340
 341 In this rather technical section, we will show how RAM can be used as a~vector
 342 computer to operate in parallel on multiple elements, as long as these elements
 343 fit in a~single machine word. On the first sight this might seem useless, as we
 344 cannot require large word sizes, but surprisingly often the elements are small
 345 enough relative to the size of the algorithm's input and thus also relative to
 346 the minimum possible word size. Also, as the following lemma shows, we can
 347 easily emulate constant-times longer words:
 348
 349 \lemman{Multiple-precision calculations}
 350 Given a~RAM with $W$-bit words, we can emulate all calculation and control
 351 instructions of a~RAM with word size $kW$ in time depending only on the~$k$.
 352 (This is usually called \df{multiple-precision arithmetics.})
 353
 354 \proof
 355 We split each word of the ``big'' machine to $W'=W/2$-bit blocks and store these
 356 blocks in $2k$ consecutive memory cells. Addition, subtraction, comparison,
 357 bitwise logical operations can be performed block-by-block. Shifts by a~multiple
 358 of~$W'$ are trivial, otherwise we can combine each block of the result from
 359 shifted versions of two original blocks.
 360 To multiply two numbers, we can use the elementary school algorithm using the $W'$-bit
 361 blocks as digits in base $2^{W'}$ --- the product of any two blocks fits
 362 in a~single word.
 363
 364 Division is harder, but Newton-Raphson iteration (see~\cite{ito:newrap})
 365 converges to the quotient in a~constant number of iterations, each of them
 366 involving $\O(1)$ multiple-precision additions and multiplications. A~good
 367 starting approximation can be obtained by dividing the two most-significant
 368 (non-zero) blocks of both numbers.
 369
 370 Another approach to division is using the improved elementary school algorithm as described
 371 by Knuth in~\cite{knuth:seminalg}. It uses $\O(k^2)$ steps, but the steps involve
 372 calculation of the most significant bit set in a~word. We will show below that it
 373 can be done in constant time without using division.
 374 \qed
 375
 376 \nota
 377 We will use boldface letters for vectors. The components of a~vector~${\bf x}$
 378 will be denoted as $x_1,\ldots,x_n$.
 379
 380 \defn
 381 The \df{canonical representation} of a~vector $(x_1,\ldots,x_n)$ of~$b$-bit numbers
 382 is an~integer XXXXXXXXXXXXXXX
 383
 384 \endpart