ram.tex

   1 \ifx\endpart\undefined
   2 \input macros.tex
   3 \fi
   4
   5 \chapter{Fine Details of Computation}
   6
   7 \section{Models and machines}
   8
   9 Traditionally, computer scientists use a~variety of computational models
  10 for a~formalism in which their algorithms are stated. If we were studying
  11 NP-completeness, we could safely assume that all the models are equivalent,
  12 possibly up to polynomial slowdown which is negligible. In our case, the
  13 differences between good and not-so-good algorithms are on a~much smaller
  14 scale. In this chapter, we will replace the usual ``yardsticks'' by a~micrometer,
  15 state our computation models carefully and develop a repertoire of basic
  16 data structures taking advantage of the fine details of the models.
  17
  18 We would like to keep the formalism close enough to the reality of the contemporary
  19 computers. This rules out Turing machines and similar sequentially addressed
  20 models, but even the remaining models are subtly different. For example, some of them
  21 allow indexing of arrays in constant time, while the others have no such operation
  22 and arrays have to be emulated with pointer structures, requiring $\Omega(\log n)$
  23 time to access a~single element of an~$n$-element array. It is hard to say which
  24 way is superior --- while most ``real'' computers have instructions for constant-time
  25 indexing, it seems to be physically impossible to fulfil this promise regardless of
  26 the size of memory. Indeed, at the level of logical gates, the depth of the
  27 actual indexing circuits is logarithmic.
  28
  29 In recent decades, most researchers in the area of combinatorial algorithms
  30 have been considering two computational models: the Random Access Machine and the Pointer
  31 Machine. The former one is closer to the programmer's view of a~real computer,
  32 the latter one is slightly more restricted and ``asymptotically safe.''
  33 We will follow this practice and study our algorithms in both models.
  34
  35 \para
  36 The \df{Random Access Machine (RAM)} is not a~single model, but rather a~family
  37 of closely related models, sharing the following properties.
  38 (See Cook and Reckhow \cite{cook:ram} for one of the usual formal definitions
  39 and Hagerup \cite{hagerup:wordram} for a~thorough description of the differences
  40 between the RAM variants.)
  41
  42 The \df{memory} of the model is represented by an~array of \df{memory cells}
  43 addressed by non-negative integers, each of them containing a~single non-negative integer.
  44 The \df{program} is a~sequence of \df{instructions} of two basic kinds: calculation
  45 instructions and control instructions.
  46
  47 \df{Calculation instructions} have two source arguments and one destination
  48 argument, each \df{argument} being either an~immediate constant (not available
  49 as destination), a~directly addressed memory cell (specified by its number)
  50 or an~indirectly addressed memory cell (its address is stored in a~directly
  51 addressed memory cell).
  52
  53 \df{Control instructions} include branches (to a~specific instruction in
  54 the program), conditional branches (e.g., jump if two arguments specified as
  55 in the calculation instructions are equal) and an~instruction to halt the program.
  56
  57 At the beginning of the computation, the memory contains the input data
  58 in specified memory cells and arbitrary values in all other cells.
  59 Then the program is executed one instruction at a~time. When it halts,
  60 specified memory cells are interpreted as the program's output.
  61
  62 \para\id{wordsize}%
  63 In the description of the RAM family, we have omitted several properties
  64 on~purpose, because different members of the family define them differently.
  65 These are: the size of the available integers, the time complexity of a~single
  66 instruction, the space complexity of a~single memory cell and the repertoire
  67 of operations available in calculation instructions.
  68
  69 If we impose no limits on the magnitude of the numbers and we assume that
  70 arithmetic and logical operations work on them in constant time, we get
  71 a~very powerful parallel computer --- we can emulate an~exponential number
  72 of parallel processors using arithmetics and suddenly almost everything can be
  73 computed in constant time, modulo encoding and decoding of input and output.
  74 Such models are unrealistic and there are two basic possibilities how to
  75 avoid this behavior:
  76
  77 \numlist\ndotted
  78 \:Keep unbounded numbers, but increase costs of instructions: each instruction
  79   consumes time proportional to the number of bits of the numbers it processes,
  80   including memory addresses. Similarly, space usage is measured in bits,
  81   counting not only the values, but also the addresses of the respective memory
  82   cells.
  83 \:Place a~limit on the size of the numbers ---define the \df{word size~$W$,}
  84   the number of bits available in the memory cells--- and keep the cost of
  85   of instructions and memory cells constant. The word size must not be constant,
  86   since we can address only~$2^W$ cells of memory. If the input of the algorithm
  87   is stored in~$N$ cells, we need~$W\ge\log N$ just to be able to read the input.
  88   On the other hand, we are interested in polynomial-time algorithms only, so $\Theta(\log N)$-bit
  89   numbers should be sufficient. In practice, we pick~$w$ to be the larger of
  90   $\Theta(\log N)$ and the size of integers used in the algorithm's input and output.
  91 \endlist
  92
  93 Both restrictions easily avoid the problems of unbounded parallelism. The first
  94 choice is theoretically cleaner and Cook et al.~show nice correspondences to the
  95 standard complexity classes, but the calculations of time and space complexity tend
  96 to be somewhat tedious. What more, when compared with the RAM with restricted
  97 word size, the complexities are usually exactly $\Theta(w)$ times higher.
  98 This does not hold in general (consider a~program which uses many small numbers
  99 and $\O(1)$ large ones), but it is true for the algorithms we are interested in.
 100 Therefore we will always assume that the operations have unit cost and we make
 101 sure that all numbers are limited by the available word size.
 102
 103 \para
 104 As for the choice of RAM operations, the following three instruction sets are often used:
 105
 106 \itemize\ibull
 107 \:\df{Word-RAM} --- allows the ``C-language operators'', i.e., addition,
 108   subtraction, multiplication, division, remainder, bitwise {\sc and,} {\sc or,} exclusive
 109   {\sc or} ({\sc xor}), negation ({\sc not}) and bitwise shifts ($\ll$ and~$\gg$).
 110 \:\df{${\rm AC}^0$-RAM} --- allows all operations from the class ${\rm AC}^0$, i.e.,
 111   those computable by constant-depth polynomial-size boolean circuits with unlimited
 112   fan-in and fan-out. This includes all operations of the Word-RAM except for multiplication,
 113   division and remainders, and also many other operations like computing the Hamming
 114   weight (number of bits set in a~given number).
 115 \:Both restrictions at once.
 116 \endlist
 117
 118 Thorup discusses the usual techniques employed by RAM algorithms in~\cite{thorup:aczero}
 119 and he shows that they work on both Word-RAM and ${\rm AC}^0$-RAM, but the combination
 120 of the two restrictions is too weak. On the other hand, the intersection of~${\rm AC}^0$
 121 with the instruction set of modern processors (e.g., adding some floating-point
 122 operations and multimedia instructions available on the Intel's Pentium~4~\cite{intel:pentium}) is already strong enough.
 123
 124 We will therefore use the Word-RAM instruction set, mentioning differences from the
 125 ${\rm AC}^0$-RAM where necessary.
 126
 127 \nota
 128 When speaking of the \df{RAM,} we implicitly mean the version with numbers limited
 129 by a~specified word size of $W$~bits, unit cost of operations and memory cells and the instruction
 130 set of the Word-RAM. This corresponds to the usage in recent algorithmic literature,
 131 although the authors rarely mention the details.
 132
 133 \para
 134 The \df{Pointer Machine (PM)} also does not have any well established definition. The
 135 various kinds of pointer machines are mapped by Ben-Amram in~\cite{benamram:pm},
 136 but unlike the RAM's they turn out to be equivalent up to constant slowdown.
 137 Our definition will be closely related to the \em{linking automaton} proposed
 138 by Knuth in~\cite{knuth:fundalg}, we will only adapt it to use RAM-like
 139 instructions instead of an~opaque control unit.
 140
 141 The PM works with two different types of data: \df{symbols} from a~finite alphabet
 142 and \df{pointers}. The memory of the machine consists of a~fixed amount of \df{registers}
 143 (some of them capable of storing a~single symbol, each of the others holds a~single pointer)
 144 and an~arbitrary amount of \df{cells}. The structure of all cells is the same: each of them
 145 again contains a~fixed number of fields for symbols and pointers. Registers can be addressed
 146 directly, the cells only via pointers --- either by using a~pointer stored in a~register,
 147 or in a~cell pointed to by a~register (longer chains of pointers cannot be followed in
 148 constant time).
 149
 150 We can therefore view the whole memory as a~directed graph, whose vertices
 151 correspond to the cells (the registers are stored in a~single special cell).
 152 The outgoing edges of each vertex correspond to pointer fields and they are
 153 labelled with distinct labels drawn from a~finite set. In addition to that,
 154 each vertex contains a~fixed amount of symbols. The program can directly access
 155 vertices within distance~2 from the register vertex.
 156
 157 The program is a~sequence of instructions of the following kinds:
 158
 159 \itemize\ibull
 160 \:\df{symbol instructions,} which read a~pair of symbols, apply an~arbitrary
 161   function on them and write the result to a~symbol register or field;
 162 \:\df{pointer instructions} for assignment of pointers to pointer registers/fields
 163   and for creation of new memory cells (a~pointer to the new cell is assigned
 164   immediately);
 165 \:\df{control instructions} --- similarly to the RAM; conditional jumps can decide
 166   on~arbitrary unary relations on symbols and compare pointers for equality.
 167 \endlist
 168
 169 Time and space complexity are defined in the straightforward way: all instructions
 170 have unit cost and so do all memory cells.
 171
 172 Both input and output of the machine are passed in the form of a~linked structure
 173 pointed to by a~designated register. For example, we can pass graphs back and forth
 174 without having to encode them as strings of numbers or symbols. This is important,
 175 because with the finite alphabet of the~PM, all symbolic representations of graphs
 176 require super-linear space and therefore also time.
 177
 178 \para
 179 Compared to the RAM, the PM lacks two important capabilities: indexing of arrays
 180 and arithmetic instructions. We can emulate both with poly-logarithmic slowdown,
 181 but it will turn out that they are rarely needed in graph algorithms. We are
 182 also going to prove that the RAM is strictly stronger, so we will prefer to
 183 formulate our algorithms in the PM model and use RAM only when necessary.
 184
 185 \thm
 186 Every program for the Word-RAM with word size~$W$ can be translated to a~program
 187 computing the same\foot{Given a~suitable encoding of inputs and outputs, of course.}
 188 on the~PM with $\O(W^2)$ slowdown. If the RAM program does not
 189 use multiplication, division and remainder operations, $\O(W)$~slowdown
 190 is sufficient.
 191
 192 \proofsketch
 193 Represent the memory of the RAM by a~balanced binary search tree or by a~radix
 194 trie of depth~$\O(W)$. Values are encoded as~linked lists of symbols pointed
 195 to by the nodes of the tree. Both direct and indirect accesses to the memory
 196 can therefore be done in~$\O(W)$ time. Use standard algorithms for arithmetic
 197 on big numbers: $\O(W)$ per operation except for multiplication, division and
 198 remainders which take $\O(W^2)$.\foot{We could use more efficient arithmetic
 199 algorithms, but the quadratic bound it good enough for our purposes.}
 200 \qed
 201
 202 \FIXME{Add references.}
 203
 204 \thm
 205 Every program for the PM running in polynomial time can be translated to a~program
 206 computing the same on the Word-RAM with only $\O(1)$ slowdown.
 207
 208 \proofsketch
 209 Encode each cell of the PM's memory to $\O(1)$ integers. Store the encoded cells to
 210 memory of the RAM sequentially and use memory addresses as pointers. As the symbols
 211 are finite and there is only a~polynomial number of cells allocated during execution
 212 of the program, $\O(\log N)$-bit integers suffice.
 213 \qed
 214
 215 \rem
 216 There is one more interesting machine: the \df{Immutable Pointer Machine} (see
 217 the description of LISP machines in \cite{benamram:pm}). It differs from the
 218 ordinary PM by the inability to modify existing memory cells. Only the contents
 219 of the registers are allowed to change. All cell modifications thus have to
 220 be performed by creating a~copy of the particular cell with some fields changed.
 221 This in turn requires the pointers to the cell to be updated, possibly triggering
 222 a~cascade of cell copies. For example, when a~node of a~binary search tree is
 223 updated, all nodes on the path from that node to the root have to be copied.
 224
 225 One of the advantages of this model is that the states of the machine are
 226 persistent --- it is possible to return to a~previously visited state by recalling
 227 the $\O(1)$ values of the registers (everything else could not have changed
 228 since that time) and ``fork'' the computations. This corresponds to the semantics
 229 of pure functional languages, e.g., Haskell.
 230
 231 Unless we are willing to accept a~logarithmic penalty in execution time and space
 232 (in fact, our emulation of the Word-RAM on the PM can be easily made immutable),
 233 the design of efficient algorithms for the immutable PM requires very different
 234 techniques. Therefore, we will concentrate on the imperative models instead
 235 and refer the interested reader to the thorough treatment of purely functional
 236 data structures in the Okasaki's monograph~\cite{okasaki:funcds}.
 237
 238 %--------------------------------------------------------------------------------
 239
 240 \section{Bucket sorting and contractions}\id{bucketsort}%
 241
 242 The Contractive Bor\o{u}vka's algorithm (\ref{contbor}) needed to contract a~given
 243 set of edges in the current graph and flatten it afterwards, all in time $\O(m)$.
 244 We have spared the technical details for this section and they will be useful
 245 in further algorithms, too.
 246
 247 As already suggested, the contractions can be performed by building an~auxiliary
 248 graph and finding its connected components, so we will take care of the flattening
 249 only.
 250
 251 \para
 252 On the RAM, it is sufficient to sort the edges lexicographically (each edge viewed
 253 as an~ordered pair of vertex identifiers with the smaller of the identifiers placed
 254 first). We can do that by a two-pass bucket sort with~$n$ buckets corresponding
 255 to the vertex identifiers.
 256
 257 However, there is a~catch in this. Suppose that we use the standard representation
 258 of graphs as adjacency lists whose heads are stored in an array indexed by vertex
 259 identifiers. When we contract and flatten the graph, the number of vertices decreases,
 260 but if we inherit the original vertex identifiers, the arrays will still have the
 261 same size. Hence we spend a~super-linear amount of time on scanning the arrays,
 262 most of the time skipping unused entries.
 263
 264 To avoid this, we just renumber the vertices after each contraction to component
 265 identifiers from the auxiliary graph and we create a~new vertex array. This way,
 266 the representation of the graph will be linear with respect to the size of the
 267 current graph.
 268
 269 \para
 270 The pointer representation of graphs does not suffer from sparsity as the vertices
 271 are always identified by pointers to per-vertex structures. Each such structure
 272 then contains all attributes associated with the vertex, including the head of the
 273 adjacency list. However, we have to find a~way how to perform bucket sorting
 274 without arrays.
 275
 276 We will keep a~list of the per-vertex structures which defines the order on~vertices.
 277 Each such structure will contain a~pointer to the head of the corresponding bucket,
 278 again stored as a~list. Putting an~edge to a~bucket can be done in constant time then,
 279 scanning all~$n$ buckets takes $\O(n+m)$ time.
 280
 281 \FIXME{Add an example of locally determined orders, e.g., tree isomorphism?}
 282
 283 \endpart