More bit tricks.

[saga.git] / ram.tex
diff --git a/ram.tex b/ram.tex

index e62452ea52ffc949c5bac9359aca5572b3499e22..831d5f0332205af0af2ce9d011568ec7c99b7eb8 100644 (file)
--- a/ram.tex
+++ b/ram.tex
@@ -88,6 +88,7 @@ avoid this behavior:
    On the other hand, we are interested in polynomial-time algorithms only, so $\Theta(\log N)$-bit
    numbers should be sufficient. In practice, we pick~$w$ to be the larger of
    $\Theta(\log N)$ and the size of integers used in the algorithm's input and output.
+  We will call an integer stored in a~single memory cell a~\df{machine word.}
  \endlist
  
  Both restrictions easily avoid the problems of unbounded parallelism. The first
@@ -105,8 +106,8 @@ As for the choice of RAM operations, the following three instruction sets are of
  
  \itemize\ibull
  \:\df{Word-RAM} --- allows the ``C-language operators'', i.e., addition,
-  subtraction, multiplication, division, remainder, bitwise {\sc and,} {\sc or,} exclusive
-  {\sc or} ({\sc xor}), negation ({\sc not}) and bitwise shifts ($\ll$ and~$\gg$).
+  subtraction, multiplication, division, remainder, bitwise $\band$, $\bor$, exclusive
+  $\bor$ ($\bxor$) and negation ($\bnot$), and bitwise shifts ($\shl$ and~$\shr$).
  \:\df{${\rm AC}^0$-RAM} --- allows all operations from the class ${\rm AC}^0$, i.e.,
    those computable by constant-depth polynomial-size boolean circuits with unlimited
    fan-in and fan-out. This includes all operations of the Word-RAM except for multiplication,
@@ -131,7 +132,7 @@ set of the Word-RAM. This corresponds to the usage in recent algorithmic literat
  although the authors rarely mention the details. In some cases, a~non-uniform variant
  of the Word-RAM is considered as well (e.g., in~\cite{hagerup:dd}):
  
-\defn
+\defn\id{nonuniform}%
  A~Word-RAM is called \df{weakly non-uniform,} if it is equipped with $\O(1)$-time
  access to a~constant number of word-sized constants, which depend only on the word
  size. These are called \df{native constants} and they are available in fixed memory
@@ -204,7 +205,7 @@ to by the nodes of the tree. Both direct and indirect accesses to the memory
  can therefore be done in~$\O(W)$ time. Use standard algorithms for arithmetic
  on big numbers: $\O(W)$ per operation except for multiplication, division and
  remainders which take $\O(W^2)$.\foot{We could use more efficient arithmetic
-algorithms, but the quadratic bound it good enough for our purposes.}
+algorithms, but the quadratic bound is good enough for our purposes.}
  \qed
  
  \FIXME{Add references.}
@@ -301,7 +302,7 @@ scanning all~$n$ buckets takes $\O(n+m)$ time.
  \section{Data structures on the RAM}
  
  There is a~lot of data structures designed specifically for the RAM, taking
-advantage of both indexing and arithmetic. In many cases, they surpass the known
+advantage of both indexing and arithmetics. In many cases, they surpass the known
  lower bounds for the same problem on the~PM and often achieve constant time
  per operation, at least when either the magnitude of the values or the size of
  the data structure are suitably bounded.
@@ -337,4 +338,272 @@ and Willard and it will also form a~basis for the rest of this chapter.
  
  \section{Bits and vectors}
  
+In this rather technical section, we will show how RAM can be used as a~vector
+computer to operate in parallel on multiple elements, as long as these elements
+fit in a~single machine word. On the first sight this might seem useless, as we
+cannot require large word sizes, but surprisingly often the elements are small
+enough relative to the size of the algorithm's input and thus also relative to
+the minimum possible word size. Also, as the following lemma shows, we can
+easily emulate constant-times longer words:
+
+\lemman{Multiple-precision calculations}
+Given a~RAM with $W$-bit words, we can emulate all calculation and control
+instructions of a~RAM with word size $kW$ in time depending only on the~$k$.
+(This is usually called \df{multiple-precision arithmetics.})
+
+\proof
+We split each word of the ``big'' machine to $W'=W/2$-bit blocks and store these
+blocks in $2k$ consecutive memory cells. Addition, subtraction, comparison,
+bitwise logical operations can be performed block-by-block. Shifts by a~multiple
+of~$W'$ are trivial, otherwise we can combine each block of the result from
+shifted versions of two original blocks.
+To multiply two numbers, we can use the elementary school algorithm using the $W'$-bit
+blocks as digits in base $2^{W'}$ --- the product of any two blocks fits
+in a~single word.
+
+Division is harder, but Newton-Raphson iteration (see~\cite{ito:newrap})
+converges to the quotient in a~constant number of iterations, each of them
+involving $\O(1)$ multiple-precision additions and multiplications. A~good
+starting approximation can be obtained by dividing the two most-significant
+(non-zero) blocks of both numbers.
+
+Another approach to division is using the improved elementary school algorithm as described
+by Knuth in~\cite{knuth:seminalg}. It uses $\O(k^2)$ steps, but the steps involve
+calculation of the most significant bit set in a~word. We will show below that it
+can be done in constant time, but we have to be careful to avoid division instructions.
+\qed
+
+\notan{Vectors}
+We will use boldface letters for vectors. The components of a~vector~${\bf x}$
+will be denoted as $x_0,\ldots,x_{d-1}$.
+
+\notan{Bit strings}\id{bitnota}%
+We will work with binary representations of natural numbers by strings over the
+alphabet $\{\0,\1\}$: we will use $\(x)$ for the number~$x$ written in binary,
+$\(x)_b$ for the same padded to exactly $b$ bits by adding zeroes at the beginning
+and $x[k]$ for the value of the $k$-th bit of~$x$.
+The usual conventions for operations on strings will be utilized: When $s$
+and~$t$ are strings, we write $st$ for their concatenation and
+$s^k$ for the string~$s$ repeated $k$~times.
+When the meaning is clear from the context,
+we will use $x$ and $\(x)$ interchangeably to avoid outbreak of symbols.
+
+\defn
+The \df{bitwise encoding} of a~vector ${\bf x}=(x_0,\ldots,x_{d-1})$ of~$b$-bit numbers
+is an~integer~$x$ such that $\(x)=\0\(x_{d-1})_b\0\(x_{d-2})_b\0\ldots\0\(x_0)_b = \sum_i 2^{(b+1)i}\cdot x_i$.
+(We have interspersed the elements with \df{separator bits.})
+
+\para
+If we wish for the whole vector to fit in a~single word, we need $(b+1)d\le W$.
+By using multiple-precision arithmetics, we can encode all vectors satisfying $bd=\O(W)$.
+We will now describe how to translate simple vector manipulations to $\O(1)$ RAM operations
+on their codes. As we are interested in asymptotic complexity only, we prefer clarity
+of the algorithms over saving instructions. Among other things, we freely use calculations
+on words of size $\O(bd)$, assuming that the Multiple-precision lemma comes to save us
+when necessary.
+
+\para
+First of all, let us observe that we can use $\band$ and $\bor$ with suitable constants
+to write zeroes or ones to an~arbitrary set of bit positions in constant time. These operations
+are usually called \df{bit masking}. Also, an~element of a~vector can be extracted or replaced
+using masking and shifts.
+
+\newdimen\slotwd
+\def\setslot#1{\setbox0=#1\slotwd=\wd0}
+\def\slot#1{\hbox to \slotwd{\hfil #1\hfil}}
+\def\[#1]{\slot{$#1$}}
+\def\9{\rack{\0}{\hss$\cdot$\hss}}
+
+\def\alik#1{%
+\medskip
+\halign{\hskip 0.15\hsize\hfil $ ##$&\hbox to 0.6\hsize{${}##$ \hss}\cr
+#1}
+\medskip
+}
+
+\algn{Operations on vectors with $d$~elements of $b$~bits each}
+
+\>$\<Replicate>(\alpha)$ --- creates a~vector $(\alpha,\ldots,\alpha)$:
+
+\alik{\<Replicate>(\alpha)=\alpha\cdot(\0^b\1)^d. \cr}
+
+\>$\<Sum>(x)$ --- calculates the sum of the elements of~${\bf x}$, assuming that
+it fits in~$b$ bits:
+
+\alik{\<Sum>(x) = x \bmod \1^{b+1}. \cr}
+
+This works because when we work modulo~$\1^{b+1}$, the number $2^{b+1}=\1\0^{b+1}$
+is congruent to~1 and thus $x = \sum_i 2^{(b+1)i}\cdot x_i \equiv \sum_i 1^i\cdot x_i \equiv \sum_i x_i$.
+As the result should fit in $b$~bits, the modulo cannot change it.
+
+If we want to avoid division, we can use double-precision multiplication instead:
+
+\setslot{\hbox{~$\0x_{d-1}$}}
+\def\.{\[]}
+\def\z{\[\0^b\1]}
+\def\dd{\slot{$\cdots$}}
+\def\vd{\slot{$\vdots$}}
+\def\rule{\noalign{\medskip\nointerlineskip}$\hrulefill$\cr\noalign{\nointerlineskip\medskip}}
+
+\alik{
+\[\0x_{d-1}] \dd \[\0x_2] \[\0x_1] \[\0x_0] \cr
+*~~ \z \dd \z\z\z \cr
+\rule
+\[x_{d-1}] \dd \[x_2] \[x_1] \[x_0] \cr
+\[x_{d-1}] \[x_{d-2}] \dd \[x_1] \[x_0] \. \cr
+\[x_{d-1}] \[x_{d-2}] \[x_{d-3}] \dd \[x_0] \. \. \cr
+\vd\vd\vd\vd\.\.\.\cr
+\[x_{d-1}] \dd \[x_2]\[x_1]\[x_0] \. \. \. \. \cr
+\rule
+\[r_{d-1}] \dd \[r_2] \[r_1] \[s_d] \dd \[s_3] \[s_2] \[s_1] \cr
+}
+
+This way, we even get the vector of all partial sums:
+$s_k=\sum_{i=0}^{k-1}x_i$, $r_k=\sum_{i=k}^{d-1}x_i$.
+
+\>$\<Cmp>(x,y)$ --- element-wise comparison of~vectors ${\bf x}$ and~${\bf y}$,
+i.e., a~vector ${\bf z}$ such that $z_i=1$ iff $x_i<y_i$:
+
+\setslot{\vbox{\hbox{~$x_{d-1}$}\hbox{~$y_{d-1}$}}}
+\alik{
+   \1 \[x_{d-1}] \1 \[x_{d-2}] \[\cdots] \1 \[x_1] \1 \[x_0]  \cr
+-~ \0 \[y_{d-1}] \0 \[y_{d-2}] \[\cdots] \0 \[y_1] \0 \[y_0]  \cr
+}
+
+We replace the separator zeroes in~$x$ by ones and subtract~$y$. These ones change back to zeroes exactly at
+the positions where $x_i<y_i$ and they stop carries from propagating, so the
+fields do not interact with each other. It only remains to shift the separator
+bits to the right positions, negate them and zero out all other bits.
+
+\>$\<Rank>(x,\alpha)$ --- return the number of elements of~${\bf x}$ which are less than~$\alpha$,
+assuming that the result fits in~$b$ bits:
+
+\alik{
+\<Sum>(\<Cmp>(x,\<Replicate>(\alpha))) \cr
+}
+
+\>$\<Insert>(x,\alpha)$ --- insert~$\alpha$ into a~sorted vector $\bf x$:
+
+Calculate $k = \<Rank>(x,\alpha)$ first, then insert~$\alpha$ as the $k$-th
+field of~$\bf x$ using masking operations.
+
+\>$\<Unpack>(\alpha)$ --- create a~vector whose components are the bits of~$\(\alpha)_b$.
+In other words, it inserts $\0^b$ between every two bits of~$\alpha$.
+
+\algo
+\:$x\=\<Replicate>(\alpha)$.
+\:$y\=(2^{b-1},2^{b-2},\ldots,2^0)$. \cmt{bitwise encoding of this vector}
+\:$z\=x \band y$.
+\:Return $\<Cmp>(z,y)$.
+\endalgo
+
+Let us observe that $z_i$ is either zero or equal to~$y_i$ depending on the value
+of the $i$-th bit of the number~$\alpha$. Comparing it with~$y_i$ normalizes it
+to either zero or one.
+
+\>$\<Unpack>_\varphi(\alpha)$ --- like \<Unpack>, but changes the order of the
+bits according to a~fixed permutation~$\varphi$: The $i$-th element of the
+resulting vector is equal to~$\alpha[\pi(i)]$.
+
+Implemented as above, but with mask~$y=(2^{\pi(b-1)},\ldots,2^{\pi(0)})$.
+
+\>$\<Pack>(x)$ --- the inverse of \<Unpack>: given a~vector of zeroes and ones,
+it produces a~number whose bits are the elements of the vector (in other words,
+it crosses out the $\0^b$ blocks).
+
+We interpret the~$x$ as an~encoding of a~vector with elements one bit shorter
+and sum these elements. For example, when $n=4$ and~$b=4$:
+
+\setslot{\hbox{$x_3$}}
+\def\z{\[\0]}
+\def\|{\hskip1pt\vrule height 10pt depth 4pt\hskip1pt}
+\def\.{\hphantom{\|}}
+
+\alik{
+\|\z\.\z\.\z\.\z\.\[x_3]\|\z\.\z\.\z\.\z\.\[x_2]\|\z\.\z\.\z\.\z\[x_1]\|\z\.\z\.\z\.\z\.\[x_0]\|\cr
+\|\z\.\z\.\z\.\z\|\[x_3]\.\z\.\z\.\z\|\z\.\[x_2]\.\z\.\z\|\z\.\z\[x_1]\.\z\|\z\.\z\.\z\.\[x_0]\|\cr
+}
+
+However, this ``reformatting'' does not produce a~correct encoding of a~vector,
+because the separator zeroes are missing. For this reason, the implementation
+of~\<Sum> by modulo does not work correctly (it produces $\1^b$ instead of $\0^b$).
+We use the technique based on multiplication instead, which does not need
+the separators. (Alternatively, we can observe that $\1^b$ is the only case
+affected, so we can handle it separately.)
+
+\para
+We can use the above tricks to perform interesting operations on individual
+numbers in constant time, too. Let us assume for a~while that we are
+operating on $b$-bit numbers and the word size is at least~$b^2$.
+This enables us to make use of intermediate vectors with $b$~elements
+of $b$~bits each.
+
+\algn{Integer operations in quadratic workspace}
+
+\>$\<Weight>(\alpha)$ --- compute the Hamming weight of~$\alpha$, i.e., the number of ones in~$\(\alpha)$.
+
+Perform \<Unpack> and then \<Sum>.
+
+\>$\<Permute>_\pi(\alpha)$ --- shuffle the bits of~$\alpha$ according
+to a~fixed permutation~$\pi$.
+
+Perform $\<Unpack>_\pi$ and \<Pack> back.
+
+\>$\<LSB>(\alpha)$ --- find the least significant bit of~$\alpha$,
+i.e., the minimum~$i$ such that $\alpha[i]=1$.
+
+By a~combination of subtraction with $\bxor$, we create a~number
+which contain ones exactly at positions below $\<LSB>(\alpha)$:
+
+\alik{
+\alpha&=                       \9\9\9\9\9\1\0\0\0\0\cr
+\alpha-1&=                     \9\9\9\9\9\0\1\1\1\1\cr
+\alpha\bxor(\alpha-1)&=                \0\9\9\9\0\1\1\1\1\1\cr
+}
+
+Then calculate the \<Weight> of the result.
+
+\>$\<MSB>(\alpha)$ --- find the most significant bit (the position
+of the highest bit set in~$\alpha$).
+
+Reverse the bits of the number first by~\<Permute>, then apply \<LSB>
+and subtract the result from~$b-1$.
+
+\rem
+As noted by Brodnik~\cite{brodnik:lsb} and others, the space requirements of
+the \<LSB> operation can be reduced to linear. We split the input to roughly
+$\sqrt{b}$ blocks of roughly $\sqrt{b}$ bits each. Then we find which blocks are
+non-zero and identify the lowest non-zero block (this is \<LSB> of a~number
+whose bits correspond to the blocks). Finally we calculate the \<LSB> of this
+block. The same trick of course works for the \<MSB> as well.
+The following algorithm shows the details.
+
+\algn{LSB in linear workspace}
+
+\algo\id{lsb}
+\algin A~$w$-bit number~$\alpha$.
+\:$b\=\lceil\sqrt{w}\rceil$. \cmt{size of a~block}
+\:$x\=(\alpha \band (\0\1^b)^b) \bor (\alpha \band (\1\0^b)^b)$.
+\hfill\break
+\cmt{$x_i\ne 0$ iff the $i$-th block is non-zero}%
+\foot{Why is this so complicated? It is tempting to take $\alpha$ itself as a~code of this vector,
+but we unfortunately need the separator bits between elements, so we create them and
+relocate the bits we have overwritten.}
+\:$y\=\<Cmp>(0,x)$. \cmt{$y_i=1$ if the $i$-th block is non-zero, otherwise $y_0=0$}
+\:$\beta\=\<Pack>(y)$. \cmt{each block compressed to a~single bit}
+\:$p\=\<LSB>(\beta)$. \cmt{the index of the lowest non-zero block}
+\:$\gamma\=(\alpha \shr bp) \band \1^b$. \cmt{the contents of that block}
+\:$q\=\<LSB>(\gamma)$. \cmt{the lowest bit set there}
+\algout $\<LSB>(\alpha) = bp+q$.
+\endalgo
+
+\rem
+We have used a~plenty of constants which depend on the format of the vectors.
+Either we can write non-uniform programs (see \ref{nonuniform}) and use native constants,
+or we can observe that all such constants can be easily manufactured. For example,
+$(\0^b\1)^d = \1^{bd} / \1^{b+1} = (2^{bd}-1)/(2^{b+1}-1)$. The only exceptions
+are the~$w$ and~$b$ in the LSB algorithm \ref{lsb}, which we are unable to produce
+in constant time.
+
+
  \endpart