X-Git-Url: http://mj.ucw.cz/gitweb/?a=blobdiff_plain;f=ram.tex;h=d39d9ba73f727dcafe19cb5c368457090c72c09f;hb=e4ece93696bd08db4ef45ed8b29e76ba6c7848c4;hp=fad390c8e3e5bebe7e915c0af04788b4dafb66a1;hpb=049b863bee2992aed6230ac377dee53c4a08cc85;p=saga.git diff --git a/ram.tex b/ram.tex index fad390c..d39d9ba 100644 --- a/ram.tex +++ b/ram.tex @@ -3,6 +3,7 @@ \fi \chapter{Fine Details of Computation} +\id{ramchap} \section{Models and machines} @@ -160,7 +161,7 @@ constant time). We can therefore view the whole memory as a~directed graph, whose vertices correspond to the cells (the registers are stored in a~single special cell). The outgoing edges of each vertex correspond to pointer fields of the cells and they are -labelled with distinct labels drawn from a~finite set. In addition to that, +labeled with distinct labels drawn from a~finite set. In addition to that, each vertex contains a~fixed amount of symbols. The program can directly access vertices within distance~2 from the register vertex. @@ -306,6 +307,7 @@ scanning all~$n$ buckets takes $\O(n+m)$ time. %-------------------------------------------------------------------------------- \section{Data structures on the RAM} +\id{ramdssect} There is a~lot of data structures designed specifically for the RAM, taking advantage of both indexing and arithmetics. In many cases, they surpass the known @@ -342,7 +344,7 @@ and Willard. It will also form a~basis for the rest of this chapter. %-------------------------------------------------------------------------------- -\section{Bits and vectors} +\section{Bits and vectors}\id{bitsect} In this rather technical section, we will show how the RAM can be used as a~vector computer to operate in parallel on multiple elements, as long as these elements @@ -429,7 +431,7 @@ replaced by a~different value in $\O(1)$ time by masking and shifts. \medskip } -\algn{Operations on vectors with $d$~elements of $b$~bits each} +\algn{Operations on vectors with $d$~elements of $b$~bits each}\id{vecops} \itemize\ibull @@ -442,7 +444,7 @@ the result fits in $b$~bits: \alik{\(x) = x \bmod \1^{b+1}. \cr} -This works because when we work modulo~$\1^{b+1}$, the number $2^{b+1}=\1\0^{b+1}$ +This is correct because when we calculate modulo~$\1^{b+1}$, the number $2^{b+1}=\1\0^{b+1}$ is congruent to~1 and thus $x = \sum_i 2^{(b+1)i}\cdot x_i \equiv \sum_i 1^i\cdot x_i \equiv \sum_i x_i$. As the result should fit in $b$~bits, the modulo makes no difference. @@ -490,20 +492,27 @@ carries from propagating, so the fields do not interact with each other: It only remains to shift the separator bits to the right positions, negate them and mask out all other bits. -\:$\(x,\alpha)$ --- return the number of elements of~${\bf x}$ which are less than~$\alpha$, +\:$\(x,\alpha)$ --- returns the number of elements of~${\bf x}$ which are less than~$\alpha$, assuming that the result fits in~$b$ bits: \alik{ \(x,\alpha) = \(\(x,\(\alpha))). \cr } -\:$\(x,\alpha)$ --- insert~$\alpha$ into a~sorted vector $\bf x$: +\:$\(x,\alpha)$ --- inserts~$\alpha$ into a~sorted vector $\bf x$: -Calculate $k = \(x,\alpha)$ first, then insert~$\alpha$ as the $k$-th +We calculate the rank of~$\alpha$ in~$x$ first, then we insert~$\alpha$ as the $k$-th field of~$\bf x$ using masking operations and shifts. -\:$\(\alpha)$ --- create a~vector whose elements are the bits of~$\(\alpha)_d$. -In other words, insert blocks~$\0^b$ between the bits of~$\alpha$. Assuming that $b\ge d$, +\algo +\:$k\=\(x,\alpha)$. +\:$\ell\=x \band \1^{(b+1)(n-k-1)}\0^{(b+1)(k+1)}$. \cmt{``left'' part of the vector} +\:$r=x \band \1^{(b+1)k}$. \cmt{``right'' part} +\:Return $(\ell\shl (b+1)) \bor (\alpha\shl ((b+1)k)) \bor r$. +\endalgo + +\:$\(\alpha)$ --- creates a~vector whose elements are the bits of~$\(\alpha)_d$. +In other words, inserts blocks~$\0^b$ between the bits of~$\alpha$. Assuming that $b\ge d$, we can do it as follows: \algo @@ -550,7 +559,7 @@ affected, so we can handle it separately.) \endlist \para -We can use the above tricks to perform interesting operations on individual +We can use the aforementioned tricks to perform interesting operations on individual numbers in constant time, too. Let us assume for a~while that we are operating on $b$-bit numbers and the word size is at least~$b^2$. This enables us to make use of intermediate vectors with $b$~elements @@ -560,16 +569,16 @@ of $b$~bits each. \itemize\ibull -\:$\(\alpha)$ --- compute the Hamming weight of~$\alpha$, i.e., the number of ones in~$\(\alpha)$. +\:$\(\alpha)$ --- computes the Hamming weight of~$\alpha$, i.e., the number of ones in~$\(\alpha)$. -Perform \ and then \. +We perform \ and then \. -\:$\_\pi(\alpha)$ --- shuffle the bits of~$\alpha$ according +\:$\_\pi(\alpha)$ --- shuffles the bits of~$\alpha$ according to a~fixed permutation~$\pi$. -Perform $\_\pi$ and \ back. +We perform $\_\pi$ and \ back. -\:$\(\alpha)$ --- find the least significant bit of~$\alpha$, +\:$\(\alpha)$ --- finds the least significant bit of~$\alpha$, i.e., the smallest~$i$ such that $\alpha[i]=1$. By a~combination of subtraction with $\bxor$, we create a~number @@ -583,7 +592,7 @@ which contains ones exactly at the position of $\(\alpha)$ and below: Then we calculate the \ of the result and subtract~1. -\:$\(\alpha)$ --- find the most significant bit of~$\alpha$ (the position +\:$\(\alpha)$ --- finds the most significant bit of~$\alpha$ (the position of the highest bit set). Reverse the bits of the number~$\alpha$ first by calling \, then apply \ @@ -627,9 +636,381 @@ relocate the bits we have overwritten.} We have used a~plenty of constants which depend on the format of the vectors. Either we can write non-uniform programs (see \ref{nonuniform}) and use native constants, or we can observe that all such constants can be easily manufactured. For example, -$(\0^b\1)^d = \1^{bd} / \1^{b+1} = (2^{bd}-1)/(2^{b+1}-1)$. The only exceptions +$(\0^b\1)^d = \1^{(b+1)d} / \1^{b+1} = (2^{(b+1)d}-1)/(2^{b+1}-1)$. The only exceptions are the~$w$ and~$b$ in the LSB algorithm \ref{lsb}, which we are unable to produce -in constant time. +in constant time. In practice we use the ``bit tricks'' as frequently called subroutines +in an~encompassing algorithm, so we usually can spend a~lot of time on the precalculation +of constants performed once during algorithm startup. + +%-------------------------------------------------------------------------------- + +\section{Q-Heaps}\id{qheaps}% + +We have shown how to perform non-trivial operations on a~set of values +in constant time, but so far only under the assumption that the number of these +values is small enough and that the values themselves are also small enough +(so that the whole set fits in $\O(1)$ machine words). Now we will show how to +lift the restriction on the magnitude of the values and still keep constant time +complexity. We will describe a~slightly simplified version of the Q-heaps developed by +Fredman and Willard in~\cite{fw:transdich}. + +The Q-heap represents a~set of at most~$k$ word-sized integers, where $k\le W^{1/4}$ +and $W$ is the word size of the machine. It will support insertion, deletion, finding +of minimum and maximum, and other operations described below, in constant time, provided that +we are willing to spend~$\O(2^{k^4})$ time on preprocessing. + +The exponential-time preprocessing may sound alarming, but a~typical application uses +Q-heaps of size $k=\log^{1/4} N$, where $N$ is the size of the algorithm's input. +This guarantees that $k\le W^{1/4}$ and $\O(2^{k^4}) = \O(N)$. Let us however +remark that the whole construction is primarily of theoretical importance +and that the huge constants involved everywhere make these heaps useless +in practical algorithms. Many of the tricks used however prove themselves +useful even in real-life implementations. + +Spending the time on reprocessing makes it possible to precompute tables for +almost arbitrary functions and then assume that they can be evaluated in +constant time: + +\lemma\id{qhprecomp}% +When~$f$ is a~function computable in polynomial time, $\O(2^{k^4})$ time is enough +to precompute a~table of the values of~$f$ for all arguments whose size is $\O(k^3)$ bits. + +\proof +There are $2^{\O(k^3)}$ possible combinations of arguments of the given size and for each of +them we spend $\poly(k)$ time on calculating the function. It remains +to observe that $2^{\O(k^3)}\cdot \poly(k) = \O(2^{k^4})$. +\qed + +\para +We will first show an~auxiliary construction based on tries and then derive +the real definition of the Q-heap from it. + +\nota +Let us introduce some notation first: +\itemize\ibull +\:$W$ --- the word size of the RAM, +\:$k = \O(W^{1/4})$ --- the limit on the size of the heap, +\:$n\le k$ --- the number of elements stored in the heap, +\:$X=\{x_1, \ldots, x_n\}$ --- the elements themselves: distinct $W$-bit numbers +indexed in a~way that $x_1 < \ldots < x_n$, +\:$g_i = \(x_i \bxor x_{i+1})$ --- the position of the most significant bit in which $x_i$ and~$x_{i+1}$ differ, +\:$R_X(x)$ --- the rank of~$x$ in~$X$, that is the number of elements of~$X$ which are less than~$x$ +(where $x$~itself need not be an~element of~$X$).\foot{We will dedicate the whole chapter \ref{rankchap} to the +study of various ranks.} +\endlist + +\defn +A~\df{trie} for a~set of strings~$S$ over a~finite alphabet~$\Sigma$ is +a~rooted tree whose vertices are the prefixes of the strings in~$S$ and there +is an~edge going from a~prefix~$\alpha$ to a~prefix~$\beta$ iff $\beta$ can be +obtained from~$\alpha$ by appending a~single symbol of the alphabet. The edge +will be labeled with the particular symbol. We will also define the~\df{letter depth} +of a~vertex as the length of the corresponding prefix. We mark the vertices +which match a~string of~$S$. + +A~\df{compressed trie} is obtained from the trie by removing the vertices of outdegree~1 +except for the root and marked vertices. +Whereever is a~directed path whose internal vertices have outdegree~1 and they carry +no mark, we replace this path by a~single edge labeled with the contatenation +of the original edge's labels. + +In both kinds of tries, we order the outgoing edges of every vertex by their labels +lexicographically. + +\obs +In both tries, the root of the tree is the empty word and for every other vertex, the +corresponding prefix is equal to the concatenation of edge labels on the path +leading from the root to that vertex. The letter depth of the vertex is equal to +the total size of these labels. All leaves correspond to strings in~$S$, but so can +some internal vertices if there are two strings in~$S$ such that one is a~prefix +of the other. + +Furthermore, the labels of all edges leaving a~common vertex are always +distinct and when we compress the trie, no two such labels have share their initial +symbols. This allows us to search in the trie efficiently: when looking for +a~string~$x$, we follow the path from the root and whenever we visit +an~internal vertex of letter depth~$d$, we test the $d$-th character of~$x$, +we follow the edge whose label starts with this character, and we check that the +rest of the label matches. + +The compressed trie is also efficient in terms of space consumption --- it has +$\O(\vert S\vert)$ vertices (this can be easily shown by induction on~$\vert S\vert$) +and all edge labels can be represented in space linear in the sum of the +lengths of the strings in~$S$. + +\defn +For our set~$X$, we define~$T$ as a~compressed trie for the set of binary +encodings of the numbers~$x_i$, padded to exactly $W$~bits, i.e., for $S = \{ \(x)_W ; x\in X \}$. + +\obs +The trie~$T$ has several interesting properties. Since all words in~$S$ have the same +length, the leaves of the trie correspond to these exact words, that is to the numbers~$x_i$. +The inorder traversal of the trie enumerates the words of~$S$ in lexicographic order +and therefore also the~$x_i$'s in the order of their values. Between each +pair of leaves $x_i$ and~$x_{i+1}$ it visits an~internal vertex whose letter depth +is exactly~$W-1-g_i$. + +\para +Let us now modify the algorithm for searching in the trie and make it compare +only the first symbols of the edges. In other words, we will test only the bits~$g_i$ +which will be called \df{guides} (as they guide us through the tree). For $x\in +X$, the modified algorithm will still return the correct leaf. For all~$x$ outside~$X$ +it will no longer fail and instead it will land on some leaf~$x_i$. At the +first sight the number~$x_i$ may seem unrelated, but we will show that it can be +used to determine the rank of~$x$ in~$X$, which will later form a~basis for all +Q-heap operations: + +\lemma\id{qhdeterm}% +The rank $R_X(x)$ is uniquely determined by a~combination of: +\itemize\ibull +\:the trie~$T$, +\:the index~$i$ of the leaf found when searching for~$x$ in~$T$, +\:the relation ($<$, $=$, $>$) between $x$ and $x_i$, +\:the bit position $b=\(x\bxor x_i)$ of the first disagreement between~$x$ and~$x_i$. +\endlist + +\proof +If $x\in X$, we detect that from $x_i=x$ and the rank is obviously~$i-1$. +Let us assume that $x\not\in X$ and imagine that we follow the same path as when +searching for~$x$, +but this time we check the full edge labels. The position~$b$ is the first position +where~$\(x)$ disagrees with a~label. Before this point, all edges not taken by +the search were leading either to subtrees containing elements all smaller than~$x$ +or all larger than~$x$ and the only values not known yet are those in the subtree +below the edge which we currently consider. Now if $x[b]=0$ (and therefore $xn$ or $x_i\ne x$, return immediately (the value is not in the heap). +\:Delete the value from~$X$: +\::$X[\varrho(i)]\=X[n]$. +\::Find $j$ such that~$\varrho(j)=n$ and set $\varrho(j)\=\varrho(i)$. +\::$n\=n-1$. +\:Update the $g_j$'s like in the previous algorithm. +\algout The updated Q-heap. +\endalgo + +\algn{Finding the $i$-th smallest element in the Q-heap}\id{qhlast}% +\algo +\algin A~Q-heap and an~index~$i$. +\:If $i<1$ or $i>n$, return {\sc undefined.} +\:Return~$x_i$. +\algout The $i$-th smallest element in the heap. +\endalgo + +\para +The heap algorithms we have just described have been built from primitives +operating in constant time, with one notable exception: the extraction +$x[B]$ of all bits of~$x$ at positions specified by the set~$B$. This cannot be done +in~$\O(1)$ time on the Word-RAM, but we can implement it with ${\rm AC}^0$ +instructions as suggested by Andersson in \cite{andersson:fusion} or even +with those ${\rm AC}^0$ instructions present on real processors (see Thorup +\cite{thorup:aczero}). On the Word-RAM, we need to make use of the fact +that the set~$B$ is not changing too much --- there are $\O(1)$ changes +per Q-heap operation. As Fredman and Willard have shown, it is possible +to maintain a~``decoder'', whose state is stored in $\O(1)$ machine words, +and which helps us to extract $x[B]$ in a~constant number of operations: + +\lemman{Extraction of bits}\id{qhxtract}% +Under the assumptions on~$k$, $W$ and the preprocessing time as in the Q-heaps,\foot{% +Actually, this is the only place where we need~$k$ to be as low as $W^{1/4}$. +In the ${\rm AC}^0$ implementation, it is enough to ensure $k\log k\le W$. +On the other hand, we need not care about the exponent because it can +be arbitrarily increased using the Q-heap trees described below.} +it is possible to maintain a~data structure for a~set~$B$ of bit positions, +which allows~$x[B]$ to be extracted in $\O(1)$ time for an~arbitrary~$x$. +When a~single element is inserted to~$B$ or deleted from~$B$, the structure +can be updated in constant time, as long as $\vert B\vert \le k$. + +\proof +See Fredman and Willard \cite{fw:transdich}. +\qed + +\para +This was the last missing bit of the mechanics of the Q-heaps. We are +therefore ready to conclude this section by the following theorem +and its consequences: + +\thmn{Q-heaps, Fredman and Willard \cite{fw:transdich}}\id{qh}% +Let $W$ and~$k$ be positive integers such that $k=\O(W^{1/4})$. Let~$Q$ +be a~Q-heap of at most $k$-elements of $W$~bits each. Then the Q-heap +operations \ref{qhfirst} to \ref{qhlast} on~$Q$ (insertion, deletion, +search for a~given value and search for the $i$-th smallest element) +run in constant time on a~Word-RAM with word size~$W$, after spending +time $\O(2^{k^4})$ on the same RAM on precomputing of tables. + +\proof +Every operation on the Q-heap can be performed in a~constant number of +vector operations and calculations of ranks. The ranks are computed +in $\O(1)$ steps involving again $\O(1)$ vector operations, binary +logarithms and bit extraction. All these can be calculated in constant +time using the results of section \ref{bitsect} and Lemma \ref{qhxtract}. +\qed + +\rem +We can also use the Q-heaps as building blocks of more complex structures +like Atomic heaps and AF-heaps (see once again \cite{fw:transdich}). We will +show a~simpler, but useful construction, sometimes called the \df{Q-heap tree.} +Suppose we have a~Q-heap of capacity~$k$ and a~parameter $d\in{\bb N}^+$. We +can build a~balanced $k$-ary tree of depth~$d$ such that its leaves contain +a~given set and every internal vertex keeps the minimum value in the subtree +rooted in it, together with a~Q-heap containing the values in all its sons. +This allows minimum to be extracted in constant time (it is placed in the root) +and when any element is changed, it is sufficient to recalculate the values +from the path from this element to the root, which takes $\O(d)$ Q-heap +operations. + +\corn{Q-heap trees}\id{qhtree}% +For every positive integer~$r$ and $\delta>0$ there exists a~data structure +capable of maintaining the minimum of a~set of at most~$r$ word-sized numbers +under insertions and deletions. Each operation takes $\O(1)$ time on a~Word-RAM +with word size $W=\Omega(r^{\delta})$, after spending time +$\O(2^{r^\delta})$ on precomputing of tables. + +\proof +Choose $\delta' \le \delta$ such that $r^{\delta'} = \O(W^{1/4})$. Build +a~Q-heap tree of depth $d=\lceil \delta/\delta'\rceil$ containing Q-heaps of +size $k=r^{\delta'}$. \qed + +\rem\id{qhtreerem}% +When we have an~algorithm with input of size~$N$, the word size is at least~$\log N$ +and we can spend time $\O(N)$ on preprocessing, so we can choose $r=\log N$ and +$\delta=1$ in the above corollary and get a~heap of size $\log N$ working in +constant time per operation. \endpart