RAM correctures.

[saga.git] / ram.tex
diff --git a/ram.tex b/ram.tex

index 9f95610a8a9074a29c6a526035cd8f487a8159f1..d178ec2a5e54bb1b9c588cef104b126309db78fb 100644 (file)
--- a/ram.tex
+++ b/ram.tex
@@ -7,7 +7,7 @@
  \section{Models and machines}
  
  Traditionally, computer scientists use a~variety of computational models
-for a~formalism in which their algorithms are stated. If we were studying
+as a~formalism in which their algorithms are stated. If we were studying
  NP-completeness, we could safely assume that all the models are equivalent,
  possibly up to polynomial slowdown which is negligible. In our case, the
  differences between good and not-so-good algorithms are on a~much smaller
@@ -17,31 +17,31 @@ data structures taking advantage of the fine details of the models.
  
  We would like to keep the formalism close enough to the reality of the contemporary
  computers. This rules out Turing machines and similar sequentially addressed
-models, but even the remaining models are subtly different. For example, some of them
-allow indexing of arrays in constant time, while the others have no such operation
-and arrays have to be emulated with pointer structures, requiring $\Omega(\log n)$
+models, but even the remaining models are subtly different from each other. For example, some of them
+allow indexing of arrays in constant time, while on the others,
+arrays have to be emulated with pointer structures, requiring $\Omega(\log n)$
  time to access a~single element of an~$n$-element array. It is hard to say which
  way is superior --- while most ``real'' computers have instructions for constant-time
  indexing, it seems to be physically impossible to fulfil this promise regardless of
-the size of memory. Indeed, at the level of logical gates, the depth of the
-actual indexing circuits is logarithmic.
+the size of memory. Indeed, at the level of logical gates inside the computer,
+the depth of the actual indexing circuits is logarithmic.
  
  In recent decades, most researchers in the area of combinatorial algorithms
  have been considering two computational models: the Random Access Machine and the Pointer
-Machine. The former one is closer to the programmer's view of a~real computer,
-the latter one is slightly more restricted and ``asymptotically safe.''
+Machine. The former is closer to the programmer's view of a~real computer,
+the latter is slightly more restricted and ``asymptotically safe.''
  We will follow this practice and study our algorithms in both models.
  
  \para
-The \df{Random Access Machine (RAM)} is not a~single model, but rather a~family
-of closely related models, sharing the following properties.
+The \df{Random Access Machine (RAM)} is not a~single coherent model, but rather a~family
+of closely related machines, sharing the following properties.
  (See Cook and Reckhow \cite{cook:ram} for one of the usual formal definitions
  and Hagerup \cite{hagerup:wordram} for a~thorough description of the differences
  between the RAM variants.)
  
-The \df{memory} of the model is represented by an~array of \df{memory cells}
+The \df{memory} of the machine is represented by an~array of \df{memory cells}
  addressed by non-negative integers, each of them containing a~single non-negative integer.
-The \df{program} is a~sequence of \df{instructions} of two basic kinds: calculation
+The \df{program} is a~finite sequence of \df{instructions} of two basic kinds: calculation
  instructions and control instructions.
  
  \df{Calculation instructions} have two source arguments and one destination
@@ -55,15 +55,15 @@ the program), conditional branches (e.g., jump if two arguments specified as
  in the calculation instructions are equal) and an~instruction to halt the program.
  
  At the beginning of the computation, the memory contains the input data
-in specified memory cells and arbitrary values in all other cells.
+in specified cells and arbitrary values in all other cells.
  Then the program is executed one instruction at a~time. When it halts,
  specified memory cells are interpreted as the program's output.
  
  \para\id{wordsize}%
-In the description of the RAM family, we have omitted several properties
+In the description of the RAM family, we have omitted several details
  on~purpose, because different members of the family define them differently.
  These are: the size of the available integers, the time complexity of a~single
-instruction, the space complexity of a~single memory cell and the repertoire
+instruction, the space complexity assigned to a~single memory cell and the set
  of operations available in calculation instructions.
  
  If we impose no limits on the magnitude of the numbers and we assume that
@@ -82,20 +82,20 @@ avoid this behavior:
    cells.
  \:Place a~limit on the size of the numbers ---define the \df{word size~$W$,}
    the number of bits available in the memory cells--- and keep the cost of
-  of instructions and memory cells constant. The word size must not be constant,
+  instructions and memory cells constant. The word size must not be constant,
    since we can address only~$2^W$ cells of memory. If the input of the algorithm
    is stored in~$N$ cells, we need~$W\ge\log N$ just to be able to read the input.
    On the other hand, we are interested in polynomial-time algorithms only, so $\Theta(\log N)$-bit
-  numbers should be sufficient. In practice, we pick~$w$ to be the larger of
+  numbers should be sufficient. In practice, we pick~$W$ to be the larger of
    $\Theta(\log N)$ and the size of integers used in the algorithm's input and output.
-  We will call an integer stored in a~single memory cell a~\df{machine word.}
+  We will call an integer which fits in a~single memory cell a~\df{machine word.}
  \endlist
  
  Both restrictions easily avoid the problems of unbounded parallelism. The first
  choice is theoretically cleaner and Cook et al.~show nice correspondences to the
  standard complexity classes, but the calculations of time and space complexity tend
  to be somewhat tedious. What more, when compared with the RAM with restricted
-word size, the complexities are usually exactly $\Theta(w)$ times higher.
+word size, the complexities are usually exactly $\Theta(W)$ times higher.
  This does not hold in general (consider a~program which uses many small numbers
  and $\O(1)$ large ones), but it is true for the algorithms we are interested in.
  Therefore we will always assume that the operations have unit cost and we make
@@ -119,8 +119,9 @@ As for the choice of RAM operations, the following three instruction sets are of
  Thorup discusses the usual techniques employed by RAM algorithms in~\cite{thorup:aczero}
  and he shows that they work on both Word-RAM and ${\rm AC}^0$-RAM, but the combination
  of the two restrictions is too weak. On the other hand, the intersection of~${\rm AC}^0$
-with the instruction set of modern processors (e.g., adding some floating-point
-operations and multimedia instructions available on the Intel's Pentium~4~\cite{intel:pentium}) is already strong enough.
+with the instruction set of modern processors is already strong enough (e.g., when we
+add some floating-point operations and multimedia instructions available on the Intel's
+Pentium~4~\cite{intel:pentium}).
  
  We will therefore use the Word-RAM instruction set, mentioning differences from the
  ${\rm AC}^0$-RAM where necessary.
@@ -152,18 +153,18 @@ and \df{pointers}. The memory of the machine consists of a~fixed amount of \df{r
  (some of them capable of storing a~single symbol, each of the others holds a~single pointer)
  and an~arbitrary amount of \df{cells}. The structure of all cells is the same: each of them
  again contains a~fixed number of fields for symbols and pointers. Registers can be addressed
-directly, the cells only via pointers --- either by using a~pointer stored in a~register,
+directly, the cells only via pointers --- by using a~pointer stored either in a~register,
  or in a~cell pointed to by a~register (longer chains of pointers cannot be followed in
  constant time).
  
  We can therefore view the whole memory as a~directed graph, whose vertices
  correspond to the cells (the registers are stored in a~single special cell).
-The outgoing edges of each vertex correspond to pointer fields and they are
+The outgoing edges of each vertex correspond to pointer fields of the cells and they are
  labelled with distinct labels drawn from a~finite set. In addition to that,
  each vertex contains a~fixed amount of symbols. The program can directly access
  vertices within distance~2 from the register vertex.
  
-The program is a~sequence of instructions of the following kinds:
+The program is a~finite sequence of instructions of the following kinds:
  
  \itemize\ibull
  \:\df{symbol instructions,} which read a~pair of symbols, apply an~arbitrary
@@ -181,8 +182,14 @@ have unit cost and so do all memory cells.
  Both input and output of the machine are passed in the form of a~linked structure
  pointed to by a~designated register. For example, we can pass graphs back and forth
  without having to encode them as strings of numbers or symbols. This is important,
-because with the finite alphabet of the~PM, all symbolic representations of graphs
-require super-linear space and therefore also time.
+because with the finite alphabet of the~PM, symbolic representations of graphs
+generally require super-linear space and therefore also time.\foot{%
+The usual representation of edges as pairs of vertex labels uses $\Theta(m\log n)$ bits
+and as a~simple counting argument shows, this is asymptotically optimal for general
+sparse graphs. On the other hand, specific families of sparse graphs can be stored
+more efficiently, e.g., by a~remarkable result of Tur\'an~\cite{turan:succinct},
+planar graphs can be encoded in~$\O(n)$ bits. Encoding of dense graphs is of
+course trivial as the adjacency matrix has only~$\Theta(n^2)$ bits.}
  
  \para
  Compared to the RAM, the PM lacks two important capabilities: indexing of arrays
@@ -192,11 +199,10 @@ also going to prove that the RAM is strictly stronger, so we will prefer to
  formulate our algorithms in the PM model and use RAM only when necessary.
  
  \thm
-Every program for the Word-RAM with word size~$W$ can be translated to a~program
-computing the same\foot{Given a~suitable encoding of inputs and outputs, of course.}
-on the~PM with $\O(W^2)$ slowdown. If the RAM program does not
-use multiplication, division and remainder operations, $\O(W)$~slowdown
-is sufficient.
+Every program for the Word-RAM with word size~$W$ can be translated to a~PM program
+computing the same with $\O(W^2)$ slowdown (given a~suitable encoding of inputs and
+outputs, of course). If the RAM program does not use multiplication, division
+and remainder operations, $\O(W)$~slowdown is sufficient.
  
  \proofsketch
  Represent the memory of the RAM by a~balanced binary search tree or by a~radix
@@ -208,7 +214,7 @@ remainders which take $\O(W^2)$.\foot{We could use more efficient arithmetic
  algorithms, but the quadratic bound is good enough for our purposes.}
  \qed
  
-\FIXME{Add references.}
+\FIXME{Add references, especially to the unbounded parallelism remark.}
  
  \thm
  Every program for the PM running in polynomial time can be translated to a~program
@@ -216,15 +222,15 @@ computing the same on the Word-RAM with only $\O(1)$ slowdown.
  
  \proofsketch
  Encode each cell of the PM's memory to $\O(1)$ integers. Store the encoded cells to
-memory of the RAM sequentially and use memory addresses as pointers. As the symbols
+the memory of the RAM sequentially and use memory addresses as pointers. As the symbols
  are finite and there is only a~polynomial number of cells allocated during execution
-of the program, $\O(\log N)$-bit integers suffice.
+of the program, $\O(\log N)$-bit integers suffice ($N$~is the size of the program's input).
  \qed
  
  \para
  There are also \df{randomized} versions of both machines. These are equipped
  with an~additional instruction for generating a~single random bit. The standard
-techniques of design and analysis of randomized algorithms then apply (see for
+techniques of design and analysis of randomized algorithms apply (see for
  example Motwani and Raghavan~\cite{motwani:randalg}).
  
  \FIXME{Consult sources. Does it make more sense to generate random words at once on the RAM?}
@@ -236,14 +242,14 @@ ordinary PM by the inability to modify existing memory cells. Only the contents
  of the registers are allowed to change. All cell modifications thus have to
  be performed by creating a~copy of the particular cell with some fields changed.
  This in turn requires the pointers to the cell to be updated, possibly triggering
-a~cascade of cell copies. For example, when a~node of a~binary search tree is
+a~cascade of further cell copies. For example, when a~node of a~binary search tree is
  updated, all nodes on the path from that node to the root have to be copied.
  
  One of the advantages of this model is that the states of the machine are
  persistent --- it is possible to return to a~previously visited state by recalling
  the $\O(1)$ values of the registers (everything else could not have changed
  since that time) and ``fork'' the computations. This corresponds to the semantics
-of pure functional languages, e.g., Haskell.
+of pure functional languages, e.g., Haskell~\cite{jones:haskell}.
  
  Unless we are willing to accept a~logarithmic penalty in execution time and space
  (in fact, our emulation of the Word-RAM on the PM can be easily made immutable),
@@ -257,12 +263,12 @@ data structures in the Okasaki's monograph~\cite{okasaki:funcds}.
  \section{Bucket sorting and contractions}\id{bucketsort}%
  
  The Contractive Bor\o{u}vka's algorithm (\ref{contbor}) needed to contract a~given
-set of edges in the current graph and flatten it afterwards, all in time $\O(m)$.
+set of edges in the current graph and flatten it afterwards, all this in time $\O(m)$.
  We have spared the technical details for this section and they will be useful
  in further algorithms, too.
  
  As already suggested, the contractions can be performed by building an~auxiliary
-graph and finding its connected components, so we will take care of the flattening
+graph and finding its connected components. Thus we will take care of the flattening
  only.
  
  \para
@@ -272,25 +278,25 @@ first). We can do that by a two-pass bucket sort with~$n$ buckets corresponding
  to the vertex identifiers.
  
  However, there is a~catch in this. Suppose that we use the standard representation
-of graphs as adjacency lists whose heads are stored in an array indexed by vertex
+of graphs by adjacency lists whose heads are stored in an array indexed by vertex
  identifiers. When we contract and flatten the graph, the number of vertices decreases,
  but if we inherit the original vertex identifiers, the arrays will still have the
-same size. Hence we spend a~super-linear amount of time on scanning the arrays,
-most of the time skipping unused entries.
+same size. Hence we spend a~super-linear amount of time on scanning the increasingly
+sparse arrays, most of the time skipping unused entries.
  
-To avoid this, we just renumber the vertices after each contraction to component
+To avoid this, we have to renumber the vertices after each contraction to component
  identifiers from the auxiliary graph and we create a~new vertex array. This way,
-the representation of the graph will be linear with respect to the size of the
+the representation of the graph will be kept linear with respect to the size of the
  current graph.
  
  \para
  The pointer representation of graphs does not suffer from sparsity as the vertices
  are always identified by pointers to per-vertex structures. Each such structure
-then contains all attributes associated with the vertex, including the head of the
+then contains all attributes associated with the vertex, including the head of its
  adjacency list. However, we have to find a~way how to perform bucket sorting
  without arrays.
  
-We will keep a~list of the per-vertex structures which defines the order on~vertices.
+We will keep a~list of the per-vertex structures which defines the order of~vertices.
  Each such structure will contain a~pointer to the head of the corresponding bucket,
  again stored as a~list. Putting an~edge to a~bucket can be done in constant time then,
  scanning all~$n$ buckets takes $\O(n+m)$ time.
@@ -303,34 +309,34 @@ scanning all~$n$ buckets takes $\O(n+m)$ time.
  
  There is a~lot of data structures designed specifically for the RAM, taking
  advantage of both indexing and arithmetics. In many cases, they surpass the known
-lower bounds for the same problem on the~PM and often achieve constant time
+lower bounds for the same problem on the~PM and they often achieve constant time
  per operation, at least when either the magnitude of the values or the size of
  the data structure are suitably bounded.
  
  A~classical result of this type are the trees of van Emde Boas~\cite{boas:vebt},
  which represent a~subset of the integers $\{0,\ldots,U-1\}$, allowing insertion,
  deletion and order operations (minimum, maximum, successor etc.) in time $\O(\log\log U)$,
-regardless of the size of the subset. If we plug this structure in the Jarn\'\i{}k's
-algorithm (\ref{jarnik}), replacing the heap, we immediately get an~algorithm
+regardless of the size of the subset. If we replace the heap used in the Jarn\'\i{}k's
+algorithm (\ref{jarnik}) by this structure, we immediately get an~algorithm
  for finding the MST in integer-weighted graphs in time $\O(m\log\log w_{max})$,
  where $w_{max}$ is the maximum weight. We will show later that it is even
  possible to achieve linear time complexity for arbitrary integer weights.
  
  A~real breakthrough has been made by Fredman and Willard, who introduced
  the Fusion trees~\cite{fw:fusion} which again perform membership and predecessor
-operation on a~set of integers, but this time with complexity $\O(\log_W n)$
+operation on a~set of $n$~integers, but this time with complexity $\O(\log_W n)$
  per operation on a~Word-RAM with $W$-bit words. This of course assumes that
-each element of the set fits in a~single word. As $W$ is at least~$\log n$,
-the operations take $\O(\log n/\log\log n)$ and we are able to sort integers
+each element of the set fits in a~single word. As $W$ must at least~$\log n$,
+the operations take $\O(\log n/\log\log n)$ and we are able to sort $n$~integers
  in time~$o(n\log n)$. This was a~beginning of a~long sequence of faster and
-faster integer sorting algorithms, culminating with the work by Thorup and Han.
-They have improved the time complexity to $\O(n\log\log n)$ deterministically~\cite{han:detsort}
+faster sorting algorithms, culminating with the work by Thorup and Han.
+They have improved the time complexity of integer sorting to $\O(n\log\log n)$ deterministically~\cite{han:detsort}
  and expected $\O(n\sqrt{\log\log n})$ for randomized algorithms~\cite{hanthor:randsort},
  both in linear space.
  
  Despite the recent progress, the corner-stone of most RAM data structures
  is still the representation of data structures by integers introduced by Fredman
-and Willard and it will also form a~basis for the rest of this chapter.
+and Willard. It will also form a~basis for the rest of this chapter.
  
  \FIXME{Add more history.}
  
@@ -338,13 +344,13 @@ and Willard and it will also form a~basis for the rest of this chapter.
  
  \section{Bits and vectors}
  
-In this rather technical section, we will show how RAM can be used as a~vector
+In this rather technical section, we will show how the RAM can be used as a~vector
  computer to operate in parallel on multiple elements, as long as these elements
-fit in a~single machine word. On the first sight this might seem useless, as we
+fit in a~single machine word. At the first sight this might seem useless, because we
  cannot require large word sizes, but surprisingly often the elements are small
  enough relative to the size of the algorithm's input and thus also relative to
  the minimum possible word size. Also, as the following lemma shows, we can
-easily emulate constant-times longer words:
+easily emulate slightly longer words:
  
  \lemman{Multiple-precision calculations}
  Given a~RAM with $W$-bit words, we can emulate all calculation and control
@@ -352,8 +358,8 @@ instructions of a~RAM with word size $kW$ in time depending only on the~$k$.
  (This is usually called \df{multiple-precision arithmetics.})
  
  \proof
-We split each word of the ``big'' machine to $W'=W/2$-bit blocks and store these
-blocks in $2k$ consecutive memory cells. Addition, subtraction, comparison,
+We split each word of the ``big'' machine to $W'$-bit blocks, where $W'=W/2$, and store these
+blocks in $2k$ consecutive memory cells. Addition, subtraction, comparison and
  bitwise logical operations can be performed block-by-block. Shifts by a~multiple
  of~$W'$ are trivial, otherwise we can combine each block of the result from
  shifted versions of two original blocks.