RAM correctures.

[saga.git] / ram.tex
diff --git a/ram.tex b/ram.tex

index c798933b7a1e7224d8fda49ab2ae07e756f72d2b..d178ec2a5e54bb1b9c588cef104b126309db78fb 100644 (file)
--- a/ram.tex
+++ b/ram.tex
@@ -7,7 +7,7 @@
  \section{Models and machines}
  
  Traditionally, computer scientists use a~variety of computational models
-for a~formalism in which their algorithms are stated. If we were studying
+as a~formalism in which their algorithms are stated. If we were studying
  NP-completeness, we could safely assume that all the models are equivalent,
  possibly up to polynomial slowdown which is negligible. In our case, the
  differences between good and not-so-good algorithms are on a~much smaller
@@ -17,31 +17,31 @@ data structures taking advantage of the fine details of the models.
  
  We would like to keep the formalism close enough to the reality of the contemporary
  computers. This rules out Turing machines and similar sequentially addressed
-models, but even the remaining models are subtly different. For example, some of them
-allow indexing of arrays in constant time, while the others have no such operation
-and arrays have to be emulated with pointer structures, requiring $\Omega(\log n)$
+models, but even the remaining models are subtly different from each other. For example, some of them
+allow indexing of arrays in constant time, while on the others,
+arrays have to be emulated with pointer structures, requiring $\Omega(\log n)$
  time to access a~single element of an~$n$-element array. It is hard to say which
  way is superior --- while most ``real'' computers have instructions for constant-time
  indexing, it seems to be physically impossible to fulfil this promise regardless of
-the size of memory. Indeed, at the level of logical gates, the depth of the
-actual indexing circuits is logarithmic.
+the size of memory. Indeed, at the level of logical gates inside the computer,
+the depth of the actual indexing circuits is logarithmic.
  
  In recent decades, most researchers in the area of combinatorial algorithms
  have been considering two computational models: the Random Access Machine and the Pointer
-Machine. The former one is closer to the programmer's view of a~real computer,
-the latter one is slightly more restricted and ``asymptotically safe.''
+Machine. The former is closer to the programmer's view of a~real computer,
+the latter is slightly more restricted and ``asymptotically safe.''
  We will follow this practice and study our algorithms in both models.
  
  \para
-The \df{Random Access Machine (RAM)} is not a~single model, but rather a~family
-of closely related models, sharing the following properties.
+The \df{Random Access Machine (RAM)} is not a~single coherent model, but rather a~family
+of closely related machines, sharing the following properties.
  (See Cook and Reckhow \cite{cook:ram} for one of the usual formal definitions
  and Hagerup \cite{hagerup:wordram} for a~thorough description of the differences
  between the RAM variants.)
  
-The \df{memory} of the model is represented by an~array of \df{memory cells}
+The \df{memory} of the machine is represented by an~array of \df{memory cells}
  addressed by non-negative integers, each of them containing a~single non-negative integer.
-The \df{program} is a~sequence of \df{instructions} of two basic kinds: calculation
+The \df{program} is a~finite sequence of \df{instructions} of two basic kinds: calculation
  instructions and control instructions.
  
  \df{Calculation instructions} have two source arguments and one destination
@@ -55,15 +55,15 @@ the program), conditional branches (e.g., jump if two arguments specified as
  in the calculation instructions are equal) and an~instruction to halt the program.
  
  At the beginning of the computation, the memory contains the input data
-in specified memory cells and arbitrary values in all other cells.
+in specified cells and arbitrary values in all other cells.
  Then the program is executed one instruction at a~time. When it halts,
  specified memory cells are interpreted as the program's output.
  
  \para\id{wordsize}%
-In the description of the RAM family, we have omitted several properties
+In the description of the RAM family, we have omitted several details
  on~purpose, because different members of the family define them differently.
  These are: the size of the available integers, the time complexity of a~single
-instruction, the space complexity of a~single memory cell and the repertoire
+instruction, the space complexity assigned to a~single memory cell and the set
  of operations available in calculation instructions.
  
  If we impose no limits on the magnitude of the numbers and we assume that
@@ -82,19 +82,20 @@ avoid this behavior:
    cells.
  \:Place a~limit on the size of the numbers ---define the \df{word size~$W$,}
    the number of bits available in the memory cells--- and keep the cost of
-  of instructions and memory cells constant. The word size must not be constant,
+  instructions and memory cells constant. The word size must not be constant,
    since we can address only~$2^W$ cells of memory. If the input of the algorithm
    is stored in~$N$ cells, we need~$W\ge\log N$ just to be able to read the input.
    On the other hand, we are interested in polynomial-time algorithms only, so $\Theta(\log N)$-bit
-  numbers should be sufficient. In practice, we pick~$w$ to be the larger of
+  numbers should be sufficient. In practice, we pick~$W$ to be the larger of
    $\Theta(\log N)$ and the size of integers used in the algorithm's input and output.
+  We will call an integer which fits in a~single memory cell a~\df{machine word.}
  \endlist
  
  Both restrictions easily avoid the problems of unbounded parallelism. The first
  choice is theoretically cleaner and Cook et al.~show nice correspondences to the
  standard complexity classes, but the calculations of time and space complexity tend
  to be somewhat tedious. What more, when compared with the RAM with restricted
-word size, the complexities are usually exactly $\Theta(w)$ times higher.
+word size, the complexities are usually exactly $\Theta(W)$ times higher.
  This does not hold in general (consider a~program which uses many small numbers
  and $\O(1)$ large ones), but it is true for the algorithms we are interested in.
  Therefore we will always assume that the operations have unit cost and we make
@@ -105,8 +106,8 @@ As for the choice of RAM operations, the following three instruction sets are of
  
  \itemize\ibull
  \:\df{Word-RAM} --- allows the ``C-language operators'', i.e., addition,
-  subtraction, multiplication, division, remainder, bitwise {\sc and,} {\sc or,} exclusive
-  {\sc or} ({\sc xor}), negation ({\sc not}) and bitwise shifts ($\ll$ and~$\gg$).
+  subtraction, multiplication, division, remainder, bitwise $\band$, $\bor$, exclusive
+  $\bor$ ($\bxor$) and negation ($\bnot$), and bitwise shifts ($\shl$ and~$\shr$).
  \:\df{${\rm AC}^0$-RAM} --- allows all operations from the class ${\rm AC}^0$, i.e.,
    those computable by constant-depth polynomial-size boolean circuits with unlimited
    fan-in and fan-out. This includes all operations of the Word-RAM except for multiplication,
@@ -118,8 +119,9 @@ As for the choice of RAM operations, the following three instruction sets are of
  Thorup discusses the usual techniques employed by RAM algorithms in~\cite{thorup:aczero}
  and he shows that they work on both Word-RAM and ${\rm AC}^0$-RAM, but the combination
  of the two restrictions is too weak. On the other hand, the intersection of~${\rm AC}^0$
-with the instruction set of modern processors (e.g., adding some floating-point
-operations and multimedia instructions available on the Intel's Pentium~4~\cite{intel:pentium}) is already strong enough.
+with the instruction set of modern processors is already strong enough (e.g., when we
+add some floating-point operations and multimedia instructions available on the Intel's
+Pentium~4~\cite{intel:pentium}).
  
  We will therefore use the Word-RAM instruction set, mentioning differences from the
  ${\rm AC}^0$-RAM where necessary.
@@ -128,7 +130,15 @@ ${\rm AC}^0$-RAM where necessary.
  When speaking of the \df{RAM,} we implicitly mean the version with numbers limited
  by a~specified word size of $W$~bits, unit cost of operations and memory cells and the instruction
  set of the Word-RAM. This corresponds to the usage in recent algorithmic literature,
-although the authors rarely mention the details.
+although the authors rarely mention the details. In some cases, a~non-uniform variant
+of the Word-RAM is considered as well (e.g., in~\cite{hagerup:dd}):
+
+\defn\id{nonuniform}%
+A~Word-RAM is called \df{weakly non-uniform,} if it is equipped with $\O(1)$-time
+access to a~constant number of word-sized constants, which depend only on the word
+size. These are called \df{native constants} and they are available in fixed memory
+cells when the program starts. (By analogy with the high-level programming languages,
+these constants can be thought of as computed at ``compile time.'')
  
  \para
  The \df{Pointer Machine (PM)} also does not have any well established definition. The
@@ -143,18 +153,18 @@ and \df{pointers}. The memory of the machine consists of a~fixed amount of \df{r
  (some of them capable of storing a~single symbol, each of the others holds a~single pointer)
  and an~arbitrary amount of \df{cells}. The structure of all cells is the same: each of them
  again contains a~fixed number of fields for symbols and pointers. Registers can be addressed
-directly, the cells only via pointers --- either by using a~pointer stored in a~register,
+directly, the cells only via pointers --- by using a~pointer stored either in a~register,
  or in a~cell pointed to by a~register (longer chains of pointers cannot be followed in
  constant time).
  
  We can therefore view the whole memory as a~directed graph, whose vertices
  correspond to the cells (the registers are stored in a~single special cell).
-The outgoing edges of each vertex correspond to pointer fields and they are
+The outgoing edges of each vertex correspond to pointer fields of the cells and they are
  labelled with distinct labels drawn from a~finite set. In addition to that,
  each vertex contains a~fixed amount of symbols. The program can directly access
  vertices within distance~2 from the register vertex.
  
-The program is a~sequence of instructions of the following kinds:
+The program is a~finite sequence of instructions of the following kinds:
  
  \itemize\ibull
  \:\df{symbol instructions,} which read a~pair of symbols, apply an~arbitrary
@@ -172,8 +182,14 @@ have unit cost and so do all memory cells.
  Both input and output of the machine are passed in the form of a~linked structure
  pointed to by a~designated register. For example, we can pass graphs back and forth
  without having to encode them as strings of numbers or symbols. This is important,
-because with the finite alphabet of the~PM, all symbolic representations of graphs
-require super-linear space and therefore also time.
+because with the finite alphabet of the~PM, symbolic representations of graphs
+generally require super-linear space and therefore also time.\foot{%
+The usual representation of edges as pairs of vertex labels uses $\Theta(m\log n)$ bits
+and as a~simple counting argument shows, this is asymptotically optimal for general
+sparse graphs. On the other hand, specific families of sparse graphs can be stored
+more efficiently, e.g., by a~remarkable result of Tur\'an~\cite{turan:succinct},
+planar graphs can be encoded in~$\O(n)$ bits. Encoding of dense graphs is of
+course trivial as the adjacency matrix has only~$\Theta(n^2)$ bits.}
  
  \para
  Compared to the RAM, the PM lacks two important capabilities: indexing of arrays
@@ -183,11 +199,10 @@ also going to prove that the RAM is strictly stronger, so we will prefer to
  formulate our algorithms in the PM model and use RAM only when necessary.
  
  \thm
-Every program for the Word-RAM with word size~$W$ can be translated to a~program
-computing the same\foot{Given a~suitable encoding of inputs and outputs, of course.}
-on the~PM with $\O(W^2)$ slowdown. If the RAM program does not
-use multiplication, division and remainder operations, $\O(W)$~slowdown
-is sufficient.
+Every program for the Word-RAM with word size~$W$ can be translated to a~PM program
+computing the same with $\O(W^2)$ slowdown (given a~suitable encoding of inputs and
+outputs, of course). If the RAM program does not use multiplication, division
+and remainder operations, $\O(W)$~slowdown is sufficient.
  
  \proofsketch
  Represent the memory of the RAM by a~balanced binary search tree or by a~radix
@@ -196,10 +211,10 @@ to by the nodes of the tree. Both direct and indirect accesses to the memory
  can therefore be done in~$\O(W)$ time. Use standard algorithms for arithmetic
  on big numbers: $\O(W)$ per operation except for multiplication, division and
  remainders which take $\O(W^2)$.\foot{We could use more efficient arithmetic
-algorithms, of course, but the quadratic bound it good enough for our purposes.}
+algorithms, but the quadratic bound is good enough for our purposes.}
  \qed
  
-\FIXME{Add references.}
+\FIXME{Add references, especially to the unbounded parallelism remark.}
  
  \thm
  Every program for the PM running in polynomial time can be translated to a~program
@@ -207,11 +222,19 @@ computing the same on the Word-RAM with only $\O(1)$ slowdown.
  
  \proofsketch
  Encode each cell of the PM's memory to $\O(1)$ integers. Store the encoded cells to
-memory of the RAM sequentially and use memory addresses as pointers. As the symbols
+the memory of the RAM sequentially and use memory addresses as pointers. As the symbols
  are finite and there is only a~polynomial number of cells allocated during execution
-of the program, $\O(\log N)$-bit integers suffice.
+of the program, $\O(\log N)$-bit integers suffice ($N$~is the size of the program's input).
  \qed
  
+\para
+There are also \df{randomized} versions of both machines. These are equipped
+with an~additional instruction for generating a~single random bit. The standard
+techniques of design and analysis of randomized algorithms apply (see for
+example Motwani and Raghavan~\cite{motwani:randalg}).
+
+\FIXME{Consult sources. Does it make more sense to generate random words at once on the RAM?}
+
  \rem
  There is one more interesting machine: the \df{Immutable Pointer Machine} (see
  the description of LISP machines in \cite{benamram:pm}). It differs from the
@@ -219,22 +242,393 @@ ordinary PM by the inability to modify existing memory cells. Only the contents
  of the registers are allowed to change. All cell modifications thus have to
  be performed by creating a~copy of the particular cell with some fields changed.
  This in turn requires the pointers to the cell to be updated, possibly triggering
-a~cascade of cell copies. For example, when a~node of a~binary search tree is
+a~cascade of further cell copies. For example, when a~node of a~binary search tree is
  updated, all nodes on the path from that node to the root have to be copied.
  
  One of the advantages of this model is that the states of the machine are
  persistent --- it is possible to return to a~previously visited state by recalling
  the $\O(1)$ values of the registers (everything else could not have changed
  since that time) and ``fork'' the computations. This corresponds to the semantics
-of pure functional languages, e.g., Haskell.
+of pure functional languages, e.g., Haskell~\cite{jones:haskell}.
  
  Unless we are willing to accept a~logarithmic penalty in execution time and space
  (in fact, our emulation of the Word-RAM on the PM can be easily made immutable),
  the design of efficient algorithms for the immutable PM requires very different
  techniques. Therefore, we will concentrate on the imperative models instead
  and refer the interested reader to the thorough treatment of purely functional
-data structures in the Okasaki's book~\cite{okasaki:funcds}.
+data structures in the Okasaki's monograph~\cite{okasaki:funcds}.
+
+%--------------------------------------------------------------------------------
+
+\section{Bucket sorting and contractions}\id{bucketsort}%
+
+The Contractive Bor\o{u}vka's algorithm (\ref{contbor}) needed to contract a~given
+set of edges in the current graph and flatten it afterwards, all this in time $\O(m)$.
+We have spared the technical details for this section and they will be useful
+in further algorithms, too.
+
+As already suggested, the contractions can be performed by building an~auxiliary
+graph and finding its connected components. Thus we will take care of the flattening
+only.
+
+\para
+On the RAM, it is sufficient to sort the edges lexicographically (each edge viewed
+as an~ordered pair of vertex identifiers with the smaller of the identifiers placed
+first). We can do that by a two-pass bucket sort with~$n$ buckets corresponding
+to the vertex identifiers.
+
+However, there is a~catch in this. Suppose that we use the standard representation
+of graphs by adjacency lists whose heads are stored in an array indexed by vertex
+identifiers. When we contract and flatten the graph, the number of vertices decreases,
+but if we inherit the original vertex identifiers, the arrays will still have the
+same size. Hence we spend a~super-linear amount of time on scanning the increasingly
+sparse arrays, most of the time skipping unused entries.
+
+To avoid this, we have to renumber the vertices after each contraction to component
+identifiers from the auxiliary graph and we create a~new vertex array. This way,
+the representation of the graph will be kept linear with respect to the size of the
+current graph.
+
+\para
+The pointer representation of graphs does not suffer from sparsity as the vertices
+are always identified by pointers to per-vertex structures. Each such structure
+then contains all attributes associated with the vertex, including the head of its
+adjacency list. However, we have to find a~way how to perform bucket sorting
+without arrays.
+
+We will keep a~list of the per-vertex structures which defines the order of~vertices.
+Each such structure will contain a~pointer to the head of the corresponding bucket,
+again stored as a~list. Putting an~edge to a~bucket can be done in constant time then,
+scanning all~$n$ buckets takes $\O(n+m)$ time.
+
+\FIXME{Add an example of locally determined orders, e.g., tree isomorphism?}
+
+%--------------------------------------------------------------------------------
+
+\section{Data structures on the RAM}
+
+There is a~lot of data structures designed specifically for the RAM, taking
+advantage of both indexing and arithmetics. In many cases, they surpass the known
+lower bounds for the same problem on the~PM and they often achieve constant time
+per operation, at least when either the magnitude of the values or the size of
+the data structure are suitably bounded.
+
+A~classical result of this type are the trees of van Emde Boas~\cite{boas:vebt},
+which represent a~subset of the integers $\{0,\ldots,U-1\}$, allowing insertion,
+deletion and order operations (minimum, maximum, successor etc.) in time $\O(\log\log U)$,
+regardless of the size of the subset. If we replace the heap used in the Jarn\'\i{}k's
+algorithm (\ref{jarnik}) by this structure, we immediately get an~algorithm
+for finding the MST in integer-weighted graphs in time $\O(m\log\log w_{max})$,
+where $w_{max}$ is the maximum weight. We will show later that it is even
+possible to achieve linear time complexity for arbitrary integer weights.
+
+A~real breakthrough has been made by Fredman and Willard, who introduced
+the Fusion trees~\cite{fw:fusion} which again perform membership and predecessor
+operation on a~set of $n$~integers, but this time with complexity $\O(\log_W n)$
+per operation on a~Word-RAM with $W$-bit words. This of course assumes that
+each element of the set fits in a~single word. As $W$ must at least~$\log n$,
+the operations take $\O(\log n/\log\log n)$ and we are able to sort $n$~integers
+in time~$o(n\log n)$. This was a~beginning of a~long sequence of faster and
+faster sorting algorithms, culminating with the work by Thorup and Han.
+They have improved the time complexity of integer sorting to $\O(n\log\log n)$ deterministically~\cite{han:detsort}
+and expected $\O(n\sqrt{\log\log n})$ for randomized algorithms~\cite{hanthor:randsort},
+both in linear space.
+
+Despite the recent progress, the corner-stone of most RAM data structures
+is still the representation of data structures by integers introduced by Fredman
+and Willard. It will also form a~basis for the rest of this chapter.
+
+\FIXME{Add more history.}
+
+%--------------------------------------------------------------------------------
+
+\section{Bits and vectors}
+
+In this rather technical section, we will show how the RAM can be used as a~vector
+computer to operate in parallel on multiple elements, as long as these elements
+fit in a~single machine word. At the first sight this might seem useless, because we
+cannot require large word sizes, but surprisingly often the elements are small
+enough relative to the size of the algorithm's input and thus also relative to
+the minimum possible word size. Also, as the following lemma shows, we can
+easily emulate slightly longer words:
+
+\lemman{Multiple-precision calculations}
+Given a~RAM with $W$-bit words, we can emulate all calculation and control
+instructions of a~RAM with word size $kW$ in time depending only on the~$k$.
+(This is usually called \df{multiple-precision arithmetics.})
+
+\proof
+We split each word of the ``big'' machine to $W'$-bit blocks, where $W'=W/2$, and store these
+blocks in $2k$ consecutive memory cells. Addition, subtraction, comparison and
+bitwise logical operations can be performed block-by-block. Shifts by a~multiple
+of~$W'$ are trivial, otherwise we can combine each block of the result from
+shifted versions of two original blocks.
+To multiply two numbers, we can use the elementary school algorithm using the $W'$-bit
+blocks as digits in base $2^{W'}$ --- the product of any two blocks fits
+in a~single word.
+
+Division is harder, but Newton-Raphson iteration (see~\cite{ito:newrap})
+converges to the quotient in a~constant number of iterations, each of them
+involving $\O(1)$ multiple-precision additions and multiplications. A~good
+starting approximation can be obtained by dividing the two most-significant
+(non-zero) blocks of both numbers.
+
+Another approach to division is using the improved elementary school algorithm as described
+by Knuth in~\cite{knuth:seminalg}. It uses $\O(k^2)$ steps, but the steps involve
+calculation of the most significant bit set in a~word. We will show below that it
+can be done in constant time, but we have to be careful to avoid division instructions.
+\qed
+
+\notan{Bit strings}\id{bitnota}%
+We will work with binary representations of natural numbers by strings over the
+alphabet $\{\0,\1\}$: we will use $\(x)$ for the number~$x$ written in binary,
+$\(x)_b$ for the same padded to exactly $b$ bits by adding zeroes at the beginning
+and $x[k]$ for the value of the $k$-th bit of~$x$.
+The usual conventions for operations on strings will be utilized: When $s$
+and~$t$ are strings, we write $st$ for their concatenation and
+$s^k$ for the string~$s$ repeated $k$~times.
+When the meaning is clear from the context,
+we will use $x$ and $\(x)$ interchangeably to avoid outbreak of symbols.
+
+\defn
+The \df{bitwise encoding} of a~vector ${\bf x}=(x_0,\ldots,x_{d-1})$ of~$b$-bit numbers
+is an~integer~$x$ such that $\(x)=\0\(x_{d-1})_b\0\(x_{d-2})_b\0\ldots\0\(x_0)_b = \sum_i 2^{(b+1)i}\cdot x_i$.
+(We have interspersed the elements with \df{separator bits.})
+
+\notan{Vectors}\id{vecnota}%
+We will use boldface letters for vectors and the same letters in normal type
+for their encodings. The elements of a~vector~${\bf x}$ will be denoted as
+$x_0,\ldots,x_{d-1}$.
+
+\para
+If we wish for the whole vector to fit in a~single word, we need $(b+1)d\le W$.
+By using multiple-precision arithmetics, we can encode all vectors satisfying $bd=\O(W)$.
+We will now describe how to translate simple vector manipulations to $\O(1)$ RAM operations
+on their codes. As we are interested in asymptotic complexity only, we prefer clarity
+of the algorithms over saving instructions. Among other things, we freely use calculations
+on words of size $\O(bd)$, assuming that the Multiple-precision lemma comes to save us
+when necessary.
+
+\para
+First of all, let us observe that we can use $\band$ and $\bor$ with suitable constants
+to write zeroes or ones to an~arbitrary set of bit positions in constant time. These operations
+are usually called \df{bit masking}. Also, an~element of a~vector can be extracted or replaced
+using masking and shifts.
+
+\newdimen\slotwd
+\def\setslot#1{\setbox0=#1\slotwd=\wd0}
+\def\slot#1{\hbox to \slotwd{\hfil #1\hfil}}
+\def\[#1]{\slot{$#1$}}
+\def\9{\rack{\0}{\hss$\cdot$\hss}}
+
+\def\alik#1{%
+\medskip
+\halign{\hskip 0.15\hsize\hfil $ ##$&\hbox to 0.6\hsize{${}##$ \hss}\cr
+#1}
+\medskip
+}
+
+\algn{Operations on vectors with $d$~elements of $b$~bits each}
+
+\itemize\ibull
+
+\:$\<Replicate>(\alpha)$ --- creates a~vector $(\alpha,\ldots,\alpha)$:
+
+\alik{\<Replicate>(\alpha)=\alpha\cdot(\0^b\1)^d. \cr}
+
+\:$\<Sum>(x)$ --- calculates the sum of the elements of~${\bf x}$, assuming that
+it fits in~$b$ bits:
+
+\alik{\<Sum>(x) = x \bmod \1^{b+1}. \cr}
+
+This works because when we work modulo~$\1^{b+1}$, the number $2^{b+1}=\1\0^{b+1}$
+is congruent to~1 and thus $x = \sum_i 2^{(b+1)i}\cdot x_i \equiv \sum_i 1^i\cdot x_i \equiv \sum_i x_i$.
+As the result should fit in $b$~bits, the modulo cannot change it.
+
+If we want to avoid division, we can use double-precision multiplication instead:
+
+\setslot{\hbox{~$\0x_{d-1}$}}
+\def\.{\[]}
+\def\z{\[\0^b\1]}
+\def\dd{\slot{$\cdots$}}
+\def\vd{\slot{$\vdots$}}
+\def\rule{\noalign{\medskip\nointerlineskip}$\hrulefill$\cr\noalign{\nointerlineskip\medskip}}
+
+\alik{
+\[\0x_{d-1}] \dd \[\0x_2] \[\0x_1] \[\0x_0] \cr
+*~~ \z \dd \z\z\z \cr
+\rule
+\[x_{d-1}] \dd \[x_2] \[x_1] \[x_0] \cr
+\[x_{d-1}] \[x_{d-2}] \dd \[x_1] \[x_0] \. \cr
+\[x_{d-1}] \[x_{d-2}] \[x_{d-3}] \dd \[x_0] \. \. \cr
+\vd\vd\vd\vd\.\.\.\cr
+\[x_{d-1}] \dd \[x_2]\[x_1]\[x_0] \. \. \. \. \cr
+\rule
+\[r_{d-1}] \dd \[r_2] \[r_1] \[s_d] \dd \[s_3] \[s_2] \[s_1] \cr
+}
+
+This way, we even get the vector of all partial sums:
+$s_k=\sum_{i=0}^{k-1}x_i$, $r_k=\sum_{i=k}^{d-1}x_i$.
+
+\:$\<Cmp>(x,y)$ --- element-wise comparison of~vectors ${\bf x}$ and~${\bf y}$,
+i.e., a~vector ${\bf z}$ such that $z_i=1$ iff $x_i<y_i$.
+
+We replace the separator zeroes in~$x$ by ones and subtract~$y$. These ones
+change back to zeroes exactly at the positions where $x_i<y_i$ and they stop
+carries from propagating, so the fields do not interact with each other:
+
+\setslot{\vbox{\hbox{~$x_{d-1}$}\hbox{~$y_{d-1}$}}}
+\def\9{\rack{\0}{\hss ?\hss}}
+\alik{
+   \1 \[x_{d-1}] \1 \[x_{d-2}] \[\cdots] \1 \[x_1] \1 \[x_0]  \cr
+-~ \0 \[y_{d-1}] \0 \[y_{d-2}] \[\cdots] \0 \[y_1] \0 \[y_0]  \cr
+\rule
+   \9 \[\ldots]  \9 \[\ldots]  \[\cdots] \9 \[\ldots] \9 \[\ldots] \cr
+}
+
+It only remains to shift the separator bits to the right positions, negate them
+and zero out all other bits.
+
+\:$\<Rank>(x,\alpha)$ --- return the number of elements of~${\bf x}$ which are less than~$\alpha$,
+assuming that the result fits in~$b$ bits:
  
+\alik{
+\<Rank>(x,\alpha) = \<Sum>(\<Cmp>(x,\<Replicate>(\alpha))). \cr
+}
+
+\:$\<Insert>(x,\alpha)$ --- insert~$\alpha$ into a~sorted vector $\bf x$:
+
+Calculate $k = \<Rank>(x,\alpha)$ first, then insert~$\alpha$ as the $k$-th
+field of~$\bf x$ using masking operations.
+
+\:$\<Unpack>(\alpha)$ --- create a~vector whose elements are the bits of~$\(\alpha)_d$.
+In other words, we insert blocks~$\0^b$ between the bits of~$\alpha$. Assuming that $b\ge d$,
+we can do it as follows:
+
+\algo
+\:$x\=\<Replicate>(\alpha)$.
+\:$y\=(2^{b-1},2^{b-2},\ldots,2^0)$. \cmt{bitwise encoding of this vector}
+\:$z\=x \band y$.
+\:Return $\<Cmp>(z,y)$.
+\endalgo
+
+Let us observe that $z_i$ is either zero or equal to~$y_i$ depending on the value
+of the $i$-th bit of the number~$\alpha$. Comparing it with~$y_i$ normalizes it
+to either zero or one.
+
+\:$\<Unpack>_\varphi(\alpha)$ --- like \<Unpack>, but changes the order of the
+bits according to a~fixed permutation~$\varphi$: The $i$-th element of the
+resulting vector is equal to~$\alpha[\pi(i)]$.
+
+Implemented as above, but with mask~$y=(2^{\pi(b-1)},\ldots,2^{\pi(0)})$.
+
+\:$\<Pack>(x)$ --- the inverse of \<Unpack>: given a~vector of zeroes and ones,
+it produces a~number whose bits are the elements of the vector (in other words,
+it crosses out the $\0^b$ blocks).
+
+We interpret the~$x$ as an~encoding of a~vector with elements one bit shorter
+and sum these elements. For example, when $n=4$ and~$b=4$:
+
+\setslot{\hbox{$x_3$}}
+\def\z{\[\0]}
+\def\|{\hskip1pt\vrule height 10pt depth 4pt\hskip1pt}
+\def\.{\hphantom{\|}}
+
+\alik{
+\|\z\.\z\.\z\.\z\.\[x_3]\|\z\.\z\.\z\.\z\.\[x_2]\|\z\.\z\.\z\.\z\[x_1]\|\z\.\z\.\z\.\z\.\[x_0]\|\cr
+\|\z\.\z\.\z\.\z\|\[x_3]\.\z\.\z\.\z\|\z\.\[x_2]\.\z\.\z\|\z\.\z\[x_1]\.\z\|\z\.\z\.\z\.\[x_0]\|\cr
+}
+
+However, this ``reformatting'' does not produce a~correct encoding of a~vector,
+because the separator zeroes are missing. For this reason, the implementation
+of~\<Sum> by modulo does not work correctly (it produces $\1^b$ instead of $\0^b$).
+We use the technique based on multiplication instead, which does not need
+the separators. (Alternatively, we can observe that $\1^b$ is the only case
+affected, so we can handle it separately.)
+
+\endlist
+
+\para
+We can use the above tricks to perform interesting operations on individual
+numbers in constant time, too. Let us assume for a~while that we are
+operating on $b$-bit numbers and the word size is at least~$b^2$.
+This enables us to make use of intermediate vectors with $b$~elements
+of $b$~bits each.
+
+\algn{Integer operations in quadratic workspace}\id{lsbmsb}
+
+\itemize\ibull
+
+\:$\<Weight>(\alpha)$ --- compute the Hamming weight of~$\alpha$, i.e., the number of ones in~$\(\alpha)$.
+
+Perform \<Unpack> and then \<Sum>.
+
+\:$\<Permute>_\pi(\alpha)$ --- shuffle the bits of~$\alpha$ according
+to a~fixed permutation~$\pi$.
+
+Perform $\<Unpack>_\pi$ and \<Pack> back.
+
+\:$\<LSB>(\alpha)$ --- find the least significant bit of~$\alpha$,
+i.e., the smallest~$i$ such that $\alpha[i]=1$.
+
+By a~combination of subtraction with $\bxor$, we create a~number
+which contain ones exactly at positions below $\<LSB>(\alpha)$:
+
+\alik{
+\alpha&=                       \9\9\9\9\9\1\0\0\0\0\cr
+\alpha-1&=                     \9\9\9\9\9\0\1\1\1\1\cr
+\alpha\bxor(\alpha-1)&=                \0\9\9\9\0\1\1\1\1\1\cr
+}
+
+Then calculate the \<Weight> of the result.
+
+\:$\<MSB>(\alpha)$ --- find the most significant bit (the position
+of the highest bit set in~$\alpha$).
+
+Reverse the bits of the number first by~\<Permute>, then apply \<LSB>
+and subtract the result from~$b-1$.
+
+\endlist
+
+\rem
+As noted by Brodnik~\cite{brodnik:lsb} and others, the space requirements of
+the \<LSB> operation can be reduced to linear. We split the input to $\sqrt{b}$
+blocks of $\sqrt{b}$ bits each. Then we determine which blocks are non-zero and
+identify the lowest such block (this is a~\<LSB> of a~number whose bits
+correspond to the blocks). Finally we calculate the \<LSB> of this block. In
+both calls to \<LSB,> we have a $\sqrt{b}$-bit number in a~$b$-bit word, so we
+can use the previous algorithm. The same trick of course works for finding the
+\<MSB> as well.
+
+The following algorithm shows the details.
+
+\algn{LSB in linear workspace}
+
+\algo\id{lsb}
+\algin A~$w$-bit number~$\alpha$.
+\:$b\=\lceil\sqrt{w}\rceil$. \cmt{size of a~block}
+\:$\ell\=b$. \cmt{the number of blocks is the same}
+\:$x\=(\alpha \band (\0\1^b)^\ell) \bor (\alpha \band (\1\0^b)^\ell)$.
+\hfill\break
+\cmt{$x_i\ne 0$ iff the $i$-th block is non-zero}%
+\foot{Why is this so complicated? It is tempting to take $\alpha$ itself as a~code of this vector,
+but we unfortunately need the separator bits between elements, so we create them and
+relocate the bits we have overwritten.}
+\:$y\=\<Cmp>(0,x)$. \cmt{$y_i=1$ if the $i$-th block is non-zero, otherwise $y_0=0$}
+\:$\beta\=\<Pack>(y)$. \cmt{each block compressed to a~single bit}
+\:$p\=\<LSB>(\beta)$. \cmt{the index of the lowest non-zero block}
+\:$\gamma\=(\alpha \shr bp) \band \1^b$. \cmt{the contents of that block}
+\:$q\=\<LSB>(\gamma)$. \cmt{the lowest bit set there}
+\algout $\<LSB>(\alpha) = bp+q$.
+\endalgo
+
+\rem
+We have used a~plenty of constants which depend on the format of the vectors.
+Either we can write non-uniform programs (see \ref{nonuniform}) and use native constants,
+or we can observe that all such constants can be easily manufactured. For example,
+$(\0^b\1)^d = \1^{bd} / \1^{b+1} = (2^{bd}-1)/(2^{b+1}-1)$. The only exceptions
+are the~$w$ and~$b$ in the LSB algorithm \ref{lsb}, which we are unable to produce
+in constant time.
  
  
  \endpart