From be43ba7df64f235d681a52bac8a578c9e5e78c72 Mon Sep 17 00:00:00 2001 From: Martin Mares Date: Sun, 3 Feb 2008 20:50:34 +0100 Subject: [PATCH] RAM correctures. --- biblio.bib | 17 ++++++++ ram.tex | 126 ++++++++++++++++++++++++++++------------------------- 2 files changed, 83 insertions(+), 60 deletions(-) diff --git a/biblio.bib b/biblio.bib index 4bcfdef..e74fa09 100644 --- a/biblio.bib +++ b/biblio.bib @@ -622,3 +622,20 @@ inproceedings{ pettie:minirand, journal={Proceedings of the 2nd Electrotechnical and Computer Science Conference, Portoroz, Slovenia}, year={1993} } + +@article{ turan:succinct, + title={{Succinct representation of graphs.}}, + author={Tur\'an, G.}, + journal={Discrete Applied Mathematics}, + volume={8}, + number={3}, + pages={289--294}, + year={1984} +} + +@book{ jones:haskell, + title={{Haskell 98 Language and Libraries: The Revised Report}}, + author={Jones, P. and Simon, L.}, + year={2003}, + publisher={Cambridge University Press} +} diff --git a/ram.tex b/ram.tex index 9f95610..d178ec2 100644 --- a/ram.tex +++ b/ram.tex @@ -7,7 +7,7 @@ \section{Models and machines} Traditionally, computer scientists use a~variety of computational models -for a~formalism in which their algorithms are stated. If we were studying +as a~formalism in which their algorithms are stated. If we were studying NP-completeness, we could safely assume that all the models are equivalent, possibly up to polynomial slowdown which is negligible. In our case, the differences between good and not-so-good algorithms are on a~much smaller @@ -17,31 +17,31 @@ data structures taking advantage of the fine details of the models. We would like to keep the formalism close enough to the reality of the contemporary computers. This rules out Turing machines and similar sequentially addressed -models, but even the remaining models are subtly different. For example, some of them -allow indexing of arrays in constant time, while the others have no such operation -and arrays have to be emulated with pointer structures, requiring $\Omega(\log n)$ +models, but even the remaining models are subtly different from each other. For example, some of them +allow indexing of arrays in constant time, while on the others, +arrays have to be emulated with pointer structures, requiring $\Omega(\log n)$ time to access a~single element of an~$n$-element array. It is hard to say which way is superior --- while most ``real'' computers have instructions for constant-time indexing, it seems to be physically impossible to fulfil this promise regardless of -the size of memory. Indeed, at the level of logical gates, the depth of the -actual indexing circuits is logarithmic. +the size of memory. Indeed, at the level of logical gates inside the computer, +the depth of the actual indexing circuits is logarithmic. In recent decades, most researchers in the area of combinatorial algorithms have been considering two computational models: the Random Access Machine and the Pointer -Machine. The former one is closer to the programmer's view of a~real computer, -the latter one is slightly more restricted and ``asymptotically safe.'' +Machine. The former is closer to the programmer's view of a~real computer, +the latter is slightly more restricted and ``asymptotically safe.'' We will follow this practice and study our algorithms in both models. \para -The \df{Random Access Machine (RAM)} is not a~single model, but rather a~family -of closely related models, sharing the following properties. +The \df{Random Access Machine (RAM)} is not a~single coherent model, but rather a~family +of closely related machines, sharing the following properties. (See Cook and Reckhow \cite{cook:ram} for one of the usual formal definitions and Hagerup \cite{hagerup:wordram} for a~thorough description of the differences between the RAM variants.) -The \df{memory} of the model is represented by an~array of \df{memory cells} +The \df{memory} of the machine is represented by an~array of \df{memory cells} addressed by non-negative integers, each of them containing a~single non-negative integer. -The \df{program} is a~sequence of \df{instructions} of two basic kinds: calculation +The \df{program} is a~finite sequence of \df{instructions} of two basic kinds: calculation instructions and control instructions. \df{Calculation instructions} have two source arguments and one destination @@ -55,15 +55,15 @@ the program), conditional branches (e.g., jump if two arguments specified as in the calculation instructions are equal) and an~instruction to halt the program. At the beginning of the computation, the memory contains the input data -in specified memory cells and arbitrary values in all other cells. +in specified cells and arbitrary values in all other cells. Then the program is executed one instruction at a~time. When it halts, specified memory cells are interpreted as the program's output. \para\id{wordsize}% -In the description of the RAM family, we have omitted several properties +In the description of the RAM family, we have omitted several details on~purpose, because different members of the family define them differently. These are: the size of the available integers, the time complexity of a~single -instruction, the space complexity of a~single memory cell and the repertoire +instruction, the space complexity assigned to a~single memory cell and the set of operations available in calculation instructions. If we impose no limits on the magnitude of the numbers and we assume that @@ -82,20 +82,20 @@ avoid this behavior: cells. \:Place a~limit on the size of the numbers ---define the \df{word size~$W$,} the number of bits available in the memory cells--- and keep the cost of - of instructions and memory cells constant. The word size must not be constant, + instructions and memory cells constant. The word size must not be constant, since we can address only~$2^W$ cells of memory. If the input of the algorithm is stored in~$N$ cells, we need~$W\ge\log N$ just to be able to read the input. On the other hand, we are interested in polynomial-time algorithms only, so $\Theta(\log N)$-bit - numbers should be sufficient. In practice, we pick~$w$ to be the larger of + numbers should be sufficient. In practice, we pick~$W$ to be the larger of $\Theta(\log N)$ and the size of integers used in the algorithm's input and output. - We will call an integer stored in a~single memory cell a~\df{machine word.} + We will call an integer which fits in a~single memory cell a~\df{machine word.} \endlist Both restrictions easily avoid the problems of unbounded parallelism. The first choice is theoretically cleaner and Cook et al.~show nice correspondences to the standard complexity classes, but the calculations of time and space complexity tend to be somewhat tedious. What more, when compared with the RAM with restricted -word size, the complexities are usually exactly $\Theta(w)$ times higher. +word size, the complexities are usually exactly $\Theta(W)$ times higher. This does not hold in general (consider a~program which uses many small numbers and $\O(1)$ large ones), but it is true for the algorithms we are interested in. Therefore we will always assume that the operations have unit cost and we make @@ -119,8 +119,9 @@ As for the choice of RAM operations, the following three instruction sets are of Thorup discusses the usual techniques employed by RAM algorithms in~\cite{thorup:aczero} and he shows that they work on both Word-RAM and ${\rm AC}^0$-RAM, but the combination of the two restrictions is too weak. On the other hand, the intersection of~${\rm AC}^0$ -with the instruction set of modern processors (e.g., adding some floating-point -operations and multimedia instructions available on the Intel's Pentium~4~\cite{intel:pentium}) is already strong enough. +with the instruction set of modern processors is already strong enough (e.g., when we +add some floating-point operations and multimedia instructions available on the Intel's +Pentium~4~\cite{intel:pentium}). We will therefore use the Word-RAM instruction set, mentioning differences from the ${\rm AC}^0$-RAM where necessary. @@ -152,18 +153,18 @@ and \df{pointers}. The memory of the machine consists of a~fixed amount of \df{r (some of them capable of storing a~single symbol, each of the others holds a~single pointer) and an~arbitrary amount of \df{cells}. The structure of all cells is the same: each of them again contains a~fixed number of fields for symbols and pointers. Registers can be addressed -directly, the cells only via pointers --- either by using a~pointer stored in a~register, +directly, the cells only via pointers --- by using a~pointer stored either in a~register, or in a~cell pointed to by a~register (longer chains of pointers cannot be followed in constant time). We can therefore view the whole memory as a~directed graph, whose vertices correspond to the cells (the registers are stored in a~single special cell). -The outgoing edges of each vertex correspond to pointer fields and they are +The outgoing edges of each vertex correspond to pointer fields of the cells and they are labelled with distinct labels drawn from a~finite set. In addition to that, each vertex contains a~fixed amount of symbols. The program can directly access vertices within distance~2 from the register vertex. -The program is a~sequence of instructions of the following kinds: +The program is a~finite sequence of instructions of the following kinds: \itemize\ibull \:\df{symbol instructions,} which read a~pair of symbols, apply an~arbitrary @@ -181,8 +182,14 @@ have unit cost and so do all memory cells. Both input and output of the machine are passed in the form of a~linked structure pointed to by a~designated register. For example, we can pass graphs back and forth without having to encode them as strings of numbers or symbols. This is important, -because with the finite alphabet of the~PM, all symbolic representations of graphs -require super-linear space and therefore also time. +because with the finite alphabet of the~PM, symbolic representations of graphs +generally require super-linear space and therefore also time.\foot{% +The usual representation of edges as pairs of vertex labels uses $\Theta(m\log n)$ bits +and as a~simple counting argument shows, this is asymptotically optimal for general +sparse graphs. On the other hand, specific families of sparse graphs can be stored +more efficiently, e.g., by a~remarkable result of Tur\'an~\cite{turan:succinct}, +planar graphs can be encoded in~$\O(n)$ bits. Encoding of dense graphs is of +course trivial as the adjacency matrix has only~$\Theta(n^2)$ bits.} \para Compared to the RAM, the PM lacks two important capabilities: indexing of arrays @@ -192,11 +199,10 @@ also going to prove that the RAM is strictly stronger, so we will prefer to formulate our algorithms in the PM model and use RAM only when necessary. \thm -Every program for the Word-RAM with word size~$W$ can be translated to a~program -computing the same\foot{Given a~suitable encoding of inputs and outputs, of course.} -on the~PM with $\O(W^2)$ slowdown. If the RAM program does not -use multiplication, division and remainder operations, $\O(W)$~slowdown -is sufficient. +Every program for the Word-RAM with word size~$W$ can be translated to a~PM program +computing the same with $\O(W^2)$ slowdown (given a~suitable encoding of inputs and +outputs, of course). If the RAM program does not use multiplication, division +and remainder operations, $\O(W)$~slowdown is sufficient. \proofsketch Represent the memory of the RAM by a~balanced binary search tree or by a~radix @@ -208,7 +214,7 @@ remainders which take $\O(W^2)$.\foot{We could use more efficient arithmetic algorithms, but the quadratic bound is good enough for our purposes.} \qed -\FIXME{Add references.} +\FIXME{Add references, especially to the unbounded parallelism remark.} \thm Every program for the PM running in polynomial time can be translated to a~program @@ -216,15 +222,15 @@ computing the same on the Word-RAM with only $\O(1)$ slowdown. \proofsketch Encode each cell of the PM's memory to $\O(1)$ integers. Store the encoded cells to -memory of the RAM sequentially and use memory addresses as pointers. As the symbols +the memory of the RAM sequentially and use memory addresses as pointers. As the symbols are finite and there is only a~polynomial number of cells allocated during execution -of the program, $\O(\log N)$-bit integers suffice. +of the program, $\O(\log N)$-bit integers suffice ($N$~is the size of the program's input). \qed \para There are also \df{randomized} versions of both machines. These are equipped with an~additional instruction for generating a~single random bit. The standard -techniques of design and analysis of randomized algorithms then apply (see for +techniques of design and analysis of randomized algorithms apply (see for example Motwani and Raghavan~\cite{motwani:randalg}). \FIXME{Consult sources. Does it make more sense to generate random words at once on the RAM?} @@ -236,14 +242,14 @@ ordinary PM by the inability to modify existing memory cells. Only the contents of the registers are allowed to change. All cell modifications thus have to be performed by creating a~copy of the particular cell with some fields changed. This in turn requires the pointers to the cell to be updated, possibly triggering -a~cascade of cell copies. For example, when a~node of a~binary search tree is +a~cascade of further cell copies. For example, when a~node of a~binary search tree is updated, all nodes on the path from that node to the root have to be copied. One of the advantages of this model is that the states of the machine are persistent --- it is possible to return to a~previously visited state by recalling the $\O(1)$ values of the registers (everything else could not have changed since that time) and ``fork'' the computations. This corresponds to the semantics -of pure functional languages, e.g., Haskell. +of pure functional languages, e.g., Haskell~\cite{jones:haskell}. Unless we are willing to accept a~logarithmic penalty in execution time and space (in fact, our emulation of the Word-RAM on the PM can be easily made immutable), @@ -257,12 +263,12 @@ data structures in the Okasaki's monograph~\cite{okasaki:funcds}. \section{Bucket sorting and contractions}\id{bucketsort}% The Contractive Bor\o{u}vka's algorithm (\ref{contbor}) needed to contract a~given -set of edges in the current graph and flatten it afterwards, all in time $\O(m)$. +set of edges in the current graph and flatten it afterwards, all this in time $\O(m)$. We have spared the technical details for this section and they will be useful in further algorithms, too. As already suggested, the contractions can be performed by building an~auxiliary -graph and finding its connected components, so we will take care of the flattening +graph and finding its connected components. Thus we will take care of the flattening only. \para @@ -272,25 +278,25 @@ first). We can do that by a two-pass bucket sort with~$n$ buckets corresponding to the vertex identifiers. However, there is a~catch in this. Suppose that we use the standard representation -of graphs as adjacency lists whose heads are stored in an array indexed by vertex +of graphs by adjacency lists whose heads are stored in an array indexed by vertex identifiers. When we contract and flatten the graph, the number of vertices decreases, but if we inherit the original vertex identifiers, the arrays will still have the -same size. Hence we spend a~super-linear amount of time on scanning the arrays, -most of the time skipping unused entries. +same size. Hence we spend a~super-linear amount of time on scanning the increasingly +sparse arrays, most of the time skipping unused entries. -To avoid this, we just renumber the vertices after each contraction to component +To avoid this, we have to renumber the vertices after each contraction to component identifiers from the auxiliary graph and we create a~new vertex array. This way, -the representation of the graph will be linear with respect to the size of the +the representation of the graph will be kept linear with respect to the size of the current graph. \para The pointer representation of graphs does not suffer from sparsity as the vertices are always identified by pointers to per-vertex structures. Each such structure -then contains all attributes associated with the vertex, including the head of the +then contains all attributes associated with the vertex, including the head of its adjacency list. However, we have to find a~way how to perform bucket sorting without arrays. -We will keep a~list of the per-vertex structures which defines the order on~vertices. +We will keep a~list of the per-vertex structures which defines the order of~vertices. Each such structure will contain a~pointer to the head of the corresponding bucket, again stored as a~list. Putting an~edge to a~bucket can be done in constant time then, scanning all~$n$ buckets takes $\O(n+m)$ time. @@ -303,34 +309,34 @@ scanning all~$n$ buckets takes $\O(n+m)$ time. There is a~lot of data structures designed specifically for the RAM, taking advantage of both indexing and arithmetics. In many cases, they surpass the known -lower bounds for the same problem on the~PM and often achieve constant time +lower bounds for the same problem on the~PM and they often achieve constant time per operation, at least when either the magnitude of the values or the size of the data structure are suitably bounded. A~classical result of this type are the trees of van Emde Boas~\cite{boas:vebt}, which represent a~subset of the integers $\{0,\ldots,U-1\}$, allowing insertion, deletion and order operations (minimum, maximum, successor etc.) in time $\O(\log\log U)$, -regardless of the size of the subset. If we plug this structure in the Jarn\'\i{}k's -algorithm (\ref{jarnik}), replacing the heap, we immediately get an~algorithm +regardless of the size of the subset. If we replace the heap used in the Jarn\'\i{}k's +algorithm (\ref{jarnik}) by this structure, we immediately get an~algorithm for finding the MST in integer-weighted graphs in time $\O(m\log\log w_{max})$, where $w_{max}$ is the maximum weight. We will show later that it is even possible to achieve linear time complexity for arbitrary integer weights. A~real breakthrough has been made by Fredman and Willard, who introduced the Fusion trees~\cite{fw:fusion} which again perform membership and predecessor -operation on a~set of integers, but this time with complexity $\O(\log_W n)$ +operation on a~set of $n$~integers, but this time with complexity $\O(\log_W n)$ per operation on a~Word-RAM with $W$-bit words. This of course assumes that -each element of the set fits in a~single word. As $W$ is at least~$\log n$, -the operations take $\O(\log n/\log\log n)$ and we are able to sort integers +each element of the set fits in a~single word. As $W$ must at least~$\log n$, +the operations take $\O(\log n/\log\log n)$ and we are able to sort $n$~integers in time~$o(n\log n)$. This was a~beginning of a~long sequence of faster and -faster integer sorting algorithms, culminating with the work by Thorup and Han. -They have improved the time complexity to $\O(n\log\log n)$ deterministically~\cite{han:detsort} +faster sorting algorithms, culminating with the work by Thorup and Han. +They have improved the time complexity of integer sorting to $\O(n\log\log n)$ deterministically~\cite{han:detsort} and expected $\O(n\sqrt{\log\log n})$ for randomized algorithms~\cite{hanthor:randsort}, both in linear space. Despite the recent progress, the corner-stone of most RAM data structures is still the representation of data structures by integers introduced by Fredman -and Willard and it will also form a~basis for the rest of this chapter. +and Willard. It will also form a~basis for the rest of this chapter. \FIXME{Add more history.} @@ -338,13 +344,13 @@ and Willard and it will also form a~basis for the rest of this chapter. \section{Bits and vectors} -In this rather technical section, we will show how RAM can be used as a~vector +In this rather technical section, we will show how the RAM can be used as a~vector computer to operate in parallel on multiple elements, as long as these elements -fit in a~single machine word. On the first sight this might seem useless, as we +fit in a~single machine word. At the first sight this might seem useless, because we cannot require large word sizes, but surprisingly often the elements are small enough relative to the size of the algorithm's input and thus also relative to the minimum possible word size. Also, as the following lemma shows, we can -easily emulate constant-times longer words: +easily emulate slightly longer words: \lemman{Multiple-precision calculations} Given a~RAM with $W$-bit words, we can emulate all calculation and control @@ -352,8 +358,8 @@ instructions of a~RAM with word size $kW$ in time depending only on the~$k$. (This is usually called \df{multiple-precision arithmetics.}) \proof -We split each word of the ``big'' machine to $W'=W/2$-bit blocks and store these -blocks in $2k$ consecutive memory cells. Addition, subtraction, comparison, +We split each word of the ``big'' machine to $W'$-bit blocks, where $W'=W/2$, and store these +blocks in $2k$ consecutive memory cells. Addition, subtraction, comparison and bitwise logical operations can be performed block-by-block. Shifts by a~multiple of~$W'$ are trivial, otherwise we can combine each block of the result from shifted versions of two original blocks. -- 2.39.2