+\notan{Bit strings}\id{bitnota}%
+We will work with binary representations of natural numbers by strings over the
+alphabet $\{\0,\1\}$: we will use $\(x)$ for the number~$x$ written in binary,
+$\(x)_b$ for the same padded to exactly $b$ bits by adding leading zeroes,
+and $x[k]$ for the value of the $k$-th bit of~$x$ (with a~numbering of bits such that $2^k[k]=1$).
+The usual conventions for operations on strings will be utilized: When $s$
+and~$t$ are strings, we write $st$ for their concatenation and
+$s^k$ for the string~$s$ repeated $k$~times.
+When the meaning is clear from the context,
+we will use $x$ and $\(x)$ interchangeably to avoid outbreak of symbols.
+
+\defn
+The \df{bitwise encoding} of a~vector ${\bf x}=(x_0,\ldots,x_{d-1})$ of~$b$-bit numbers
+is an~integer~$x$ such that $\(x)=\(x_{d-1})_b\0\(x_{d-2})_b\0\ldots\0\(x_0)_b$, i.e.,
+$x = \sum_i 2^{(b+1)i}\cdot x_i$. (We have interspersed the elements with \df{separator bits.})
+
+\notan{Vectors}\id{vecnota}%
+We will use boldface letters for vectors and the same letters in normal type
+for the encodings of these vectors. The elements of a~vector~${\bf x}$ will be written as
+$x_0,\ldots,x_{d-1}$.
+
+\para
+If we want to fit the whole vector in a~single word, the parameters $b$ and~$d$ must satisfy
+the condition $(b+1)d\le W$.
+By using multiple-precision arithmetics, we can encode all vectors satisfying $bd=\O(W)$.
+We will now describe how to translate simple vector manipulations to sequences of $\O(1)$ RAM operations
+on their codes. As we are interested in asymptotic complexity only, we will prefer clarity
+of the algorithms over saving instructions. Among other things, we will freely use calculations
+on words of size $\O(bd)$, assuming that the Multiple-precision lemma comes to save us
+when necessary.
+
+\para
+First of all, let us observe that we can use $\band$ and $\bor$ with suitable constants
+to write zeroes or ones to an~arbitrary set of bit positions at once. These operations
+are usually called \df{bit masking}. Also, any element of a~vector can be extracted or
+replaced by a~different value in $\O(1)$ time by masking and shifts.
+
+\newdimen\slotwd
+\def\setslot#1{\setbox0=#1\slotwd=\wd0}
+\def\slot#1{\hbox to \slotwd{\hfil #1\hfil}}
+\def\[#1]{\slot{$#1$}}
+\def\9{\rack{\0}{\hss$\cdot$\hss}}
+
+\def\alik#1{%
+\medskip
+\halign{\hskip 0.15\hsize\hfil $ ##$&\hbox to 0.6\hsize{${}##$ \hss}\cr
+#1}
+\medskip
+}
+
+\algn{Operations on vectors with $d$~elements of $b$~bits each}\id{vecops}
+
+\itemize\ibull
+
+\:$\<Replicate>(\alpha)$ --- Create a~vector $(\alpha,\ldots,\alpha)$:
+
+\alik{\<Replicate>(\alpha)=\alpha\cdot(\0^b\1)^d. \cr}
+
+\:$\<Sum>(x)$ --- Calculate the sum of the elements of~${\bf x}$, assuming that
+the result fits in $b$~bits:
+
+\alik{\<Sum>(x) = x \bmod \1^{b+1}. \cr}
+
+This is correct because when we calculate modulo~$\1^{b+1}$, the number $2^{b+1}=\1\0^{b+1}$
+is congruent to~1 and thus $x = \sum_i 2^{(b+1)i}\cdot x_i \equiv \sum_i 1^i\cdot x_i \equiv \sum_i x_i$.
+As the result should fit in $b$~bits, the modulo makes no difference.
+
+If we want to avoid division, we can use double-precision multiplication instead:
+
+\setslot{\hbox{~$\0x_{d-1}$}}
+\def\.{\[]}
+\def\z{\[\0^b\1]}
+\def\dd{\slot{$\cdots$}}
+\def\vd{\slot{$\vdots$}}
+\def\rule{\noalign{\medskip\nointerlineskip}$\hrulefill$\cr\noalign{\nointerlineskip\medskip}}
+
+\alik{
+\[\0x_{d-1}] \dd \[\0x_2] \[\0x_1] \[\0x_0] \cr
+*~~ \z \dd \z\z\z \cr
+\rule
+\[x_{d-1}] \dd \[x_2] \[x_1] \[x_0] \cr
+\[x_{d-1}] \[x_{d-2}] \dd \[x_1] \[x_0] \. \cr
+\[x_{d-1}] \[x_{d-2}] \[x_{d-3}] \dd \[x_0] \. \. \cr
+\vd\vd\vd\vd\.\.\.\cr
+\[x_{d-1}] \dd \[x_2]\[x_1]\[x_0] \. \. \. \. \cr
+\rule
+\[r_{d-1}] \dd \[r_2] \[r_1] \[s_d] \dd \[s_3] \[s_2] \[s_1] \cr
+}
+
+This way, we also get the vector of all partial sums:
+$s_k=\sum_{i=0}^{k-1}x_i$, $r_k=\sum_{i=k}^{d-1}x_i$.
+
+\:$\<Cmp>(x,y)$ --- Compare vectors ${\bf x}$ and~${\bf y}$ element-wise,
+i.e., make a~vector~${\bf z}$ such that $z_i=1$ if $x_i<y_i$ and $z_i=0$ otherwise.
+
+We replace the separator zeroes in~$x$ by ones and subtract~$y$. These ones
+change back to zeroes exactly at the positions where $x_i<y_i$ and they stop
+carries from propagating, so the fields do not interact with each other:
+
+\setslot{\vbox{\hbox{~$x_{d-1}$}\hbox{~$y_{d-1}$}}}
+\def\9{\rack{\0}{\hss ?\hss}}
+\alik{
+ \1 \[x_{d-1}] \1 \[x_{d-2}] \[\cdots] \1 \[x_1] \1 \[x_0] \cr
+-~ \0 \[y_{d-1}] \0 \[y_{d-2}] \[\cdots] \0 \[y_1] \0 \[y_0] \cr
+\rule
+ \9 \[\ldots] \9 \[\ldots] \[\cdots] \9 \[\ldots] \9 \[\ldots] \cr
+}
+
+It only remains to shift the separator bits to the right positions, negate them
+and mask out all other bits.
+
+\:$\<Rank>(x,\alpha)$ --- Return the number of elements of~${\bf x}$ which are less than~$\alpha$,
+assuming that the result fits in~$b$ bits:
+
+\alik{
+\<Rank>(x,\alpha) = \<Sum>(\<Cmp>(x,\<Replicate>(\alpha))). \cr
+}
+
+\:$\<Insert>(x,\alpha)$ --- Insert~$\alpha$ into a~sorted vector $\bf x$:
+
+We calculate the rank of~$\alpha$ in~$x$ first, then we insert~$\alpha$ into the particular
+field of~$\bf x$ using masking operations and shifts.
+
+\algo
+\:$k\=\<Rank>(x,\alpha)$.
+\:$\ell\=x \band \1^{(b+1)(n-k)}\0^{(b+1)k}$. \cmt{``left'' part of the vector}
+\:$r=x \band \1^{(b+1)k}$. \cmt{``right'' part}
+\:Return $(\ell\shl (b+1)) \bor (\alpha\shl ((b+1)k)) \bor r$.
+\endalgo
+
+\:$\<Unpack>(\alpha)$ --- Create a~vector whose elements are the bits of~$\(\alpha)_d$.
+In other words, insert blocks~$\0^b$ between the bits of~$\alpha$. Assuming that $b\ge d$,
+we can do it as follows:
+
+\algo
+\:$x\=\<Replicate>(\alpha)$.
+\:$y\=(2^{b-1},2^{b-2},\ldots,2^0)$. \cmt{bitwise encoding of this vector}
+\:$z\=x \band y$.
+\:Return $\<Cmp>(z,y) \bxor (\0^b\1)^d$.
+\endalgo
+
+Let us observe that $z_i$ is either zero or equal to~$y_i$ depending on the value
+of the $i$-th bit of the number~$\alpha$. Comparing it with~$y_i$ normalizes it
+to either zero or one, but in the opposite way than we need, so we flip the bits
+by an~additional $\bxor$.
+
+\:$\<Unpack>_\pi(\alpha)$ --- Like \<Unpack>, but change the order of the
+bits according to a~fixed permutation~$\pi$: The $i$-th element of the
+resulting vector is equal to~$\alpha[\pi(i)]$.
+
+Implemented as above, but with a~mask $y=(2^{\pi(b-1)},\ldots,2^{\pi(0)})$.
+
+\:$\<Pack>(x)$ --- The inverse of \<Unpack>: given a~vector of zeroes and ones,
+produce a~number whose bits are the elements of the vector (in other words,
+it crosses out the $\0^b$ blocks).
+
+We interpret the~$x$ as an~encoding of a~vector with elements one bit shorter
+and we sum these elements. For example, when $n=4$ and~$b=4$:
+
+\setslot{\hbox{$x_3$}}
+\def\z{\[\0]}
+\def\|{\hskip1pt\vrule height 10pt depth 4pt\hskip1pt}
+\def\.{\hphantom{\|}}
+
+\alik{
+\|\z\.\z\.\z\.\z\.\[x_3]\|\z\.\z\.\z\.\z\.\[x_2]\|\z\.\z\.\z\.\z\[x_1]\|\z\.\z\.\z\.\z\.\[x_0]\|\cr
+\|\z\.\z\.\z\.\z\|\[x_3]\.\z\.\z\.\z\|\z\.\[x_2]\.\z\.\z\|\z\.\z\[x_1]\.\z\|\z\.\z\.\z\.\[x_0]\|\cr
+}
+
+However, this ``reformatting'' does not produce a~correct encoding of a~vector,
+because the separator zeroes are missing. For this reason, the implementation
+of~\<Sum> using modulo does not work correctly (it produces $\0^b$ instead of $\1^b$).
+We therefore use the technique based on multiplication instead, which does not need
+the separators. (Alternatively, we can observe that $\1^b$ is the only case
+affected, so we can handle it separately.)
+
+\endlist
+
+\paran{Scalar operations}%
+We can use the aforementioned tricks to perform interesting operations on individual
+numbers in constant time, too. Let us assume for a~while that we are
+operating on $b$-bit numbers and the word size is at least~$b^2$.
+This enables us to make use of intermediate vectors with $b$~elements
+of $b$~bits each.
+
+\algn{Integer operations in quadratic workspace}\id{lsbmsb}
+
+\itemize\ibull
+
+\:$\<Weight>(\alpha)$ --- Compute the Hamming weight of~$\alpha$, i.e., the number of ones in~$\(\alpha)$.
+
+We perform \<Unpack> and then \<Sum>.
+
+\:$\<Permute>_\pi(\alpha)$ --- Shuffle the bits of~$\alpha$ according
+to a~fixed permutation~$\pi$.
+
+We perform $\<Unpack>_\pi$ and \<Pack> back.
+
+\:$\<LSB>(\alpha)$ --- Find the least significant bit of~$\alpha$,
+i.e., the smallest~$i$ such that $\alpha[i]=1$.
+
+By a~combination of subtraction with $\bxor$, we create a~number
+that contains ones exactly at the position of $\<LSB>(\alpha)$ and below:
+
+\alik{
+\alpha&= \9\9\9\9\9\1\0\0\0\0\cr
+\alpha-1&= \9\9\9\9\9\0\1\1\1\1\cr
+\alpha\bxor(\alpha-1)&= \0\9\9\9\0\1\1\1\1\1\cr
+}
+
+Then we calculate the \<Weight> of the result and subtract~1.
+
+\:$\<MSB>(\alpha)$ --- Find the most significant bit of~$\alpha$ (the position
+of the highest bit set).
+
+Reverse the bits of the number~$\alpha$ first by calling \<Permute>, then apply \<LSB>
+and subtract the result from~$b-1$.
+
+\endlist
+
+\paran{LSB and MSB}%
+As noted by Brodnik~\cite{brodnik:lsb} and others, the space requirements of
+the \<LSB> operation can be lowered to linear. We split the $w$-bit input to $\sqrt{w}$
+blocks of $\sqrt{w}$ bits each. Then we determine which blocks are non-zero and
+identify the lowest such block (this is a~\<LSB> of a~number whose bits
+correspond to the blocks). Finally we calculate the \<LSB> of this block. In
+both calls to \<LSB,> we have a $\sqrt{w}$-bit number in a~$w$-bit word, so we
+can use the previous algorithm. The same trick of course applies to for finding the
+\<MSB>, too.
+
+The following algorithm shows the details.
+
+\algn{LSB in linear workspace}
+
+\algo\id{lsb}
+\algin A~$w$-bit number~$\alpha$.
+\:$b\=\lceil\sqrt{w}\,\rceil$. \cmt{the size of a~block}
+\:$\ell\=b$. \cmt{the number of blocks is the same}
+\:$x\=(\alpha \band (\0\1^b)^\ell) \bor ((\alpha \band (\1\0^b)^\ell) \shr 1)$.
+\hfill\break
+\cmt{encoding of a~vector~${\bf x}$ such that $x_i\ne 0$ iff the $i$-th block is non-zero}%
+\foot{Why is this so complicated? It is tempting to take $\alpha$ itself as a~code of this vector,
+but we must not forget the separator bits between elements, so we create them and
+relocate the bits we have overwritten.}
+\:$y\=\<Cmp>(0,x)$. \cmt{$y_i=1$ if the $i$-th block is non-zero, otherwise $y_0=0$}
+\:$\beta\=\<Pack>(y)$. \cmt{each block compressed to a~single bit}
+\:$p\=\<LSB>(\beta)$. \cmt{the index of the lowest non-zero block}
+\:$\gamma\=(\alpha \shr bp) \band \1^b$. \cmt{the contents of that block}
+\:$q\=\<LSB>(\gamma)$. \cmt{the lowest bit set there}
+\algout $\<LSB>(\alpha) = bp+q$.
+\endalgo
+
+\paran{Constants}%
+We have used a~plenty of constants that depend on the format of the vectors.
+Either we can write non-uniform programs (see \ref{nonuniform}) and use native constants,
+or we can observe that all such constants can be easily manufactured. For example,
+$(\0^b\1)^d = \1^{(b+1)d} / \1^{b+1} = (2^{(b+1)d}-1)/(2^{b+1}-1) = ((1 \shl (b+1)d)-1) / ((2\shl b) - 1)$. The only exceptions
+are the~$w$ and~$b$ in the LSB algorithm \ref{lsb}, which we are unable to produce
+in constant time. In practice we use the ``bit tricks'' as frequently called subroutines
+in an~encompassing algorithm, so we usually can afford spending a~lot of time on the precalculation
+of constants performed once during algorithm startup.
+
+\paran{History}%
+The history of combining arithmetic and logical operations to obtain fast programs for various
+interesting functions is blurred. Many of the bit tricks, which we have described, have been
+discovered independently by numerous people in the early ages of digital computers.
+Since then, they have become a~part of the computer science folklore. Probably the
+earliest documented occurrence is in the 1972's memo of the MIT Artificial Intelligence
+Lab \cite{hakmem}. However, until the work of Fredman and Willard nobody seemed to
+realize the full consequences.
+
+%--------------------------------------------------------------------------------
+
+\section{Q-Heaps}\id{qheaps}%
+
+We have shown how to perform non-trivial operations on a~set of values
+in constant time, but so far only under the assumption that the number of these
+values is small enough and that the values themselves are also small enough
+(so that the whole set fits in $\O(1)$ machine words). Now we will show how to
+lift the restriction on the magnitude of the values and still keep constant time
+complexity. We will describe a~slightly simplified version of the Q-heaps developed by
+Fredman and Willard in~\cite{fw:transdich}.
+
+The Q-heap represents a~set of at most~$k$ word-sized integers, where $k\le W^{1/4}$
+and $W$ is the word size of the machine. It will support insertion, deletion, finding
+of minimum and maximum, and other operations described below, in constant time, provided that
+we are willing to spend~$\O(2^{k^4})$ time on preprocessing.
+
+The exponential-time preprocessing may sound alarming, but a~typical application uses
+Q-heaps of size $k=\log^{1/4} N$, where $N$ is the size of the algorithm's input.
+This guarantees that $k\le W^{1/4}$ and $\O(2^{k^4}) = \O(N)$. Let us however
+remark that the whole construction is primarily of theoretical importance ---
+the huge multiplicative constants hidden in the~$\O$ make these heaps useless
+in practical algorithms. Despite this, many of the tricks we develop have proven
+themselves useful even in real-life data structures.
+
+Spending so much time on preprocessing makes it possible to precompute tables of
+almost arbitrary functions and then assume that the functions can be evaluated in
+constant time:
+
+\lemma\id{qhprecomp}%
+When~$f$ is a~function computable in polynomial time, $\O(2^{k^4})$ time is enough
+to precompute a~table of the values of~$f$ for all arguments whose size is $\O(k^3)$ bits.
+
+\proof
+There are $2^{\O(k^3)}$ possible combinations of arguments of the given size and for each of
+them we spend $\poly(k)$ time on calculating the function. It remains
+to observe that $2^{\O(k^3)}\cdot \poly(k) = \O(2^{k^4})$.
+\qed
+
+\paran{Tries and ranks}%
+We will first develop an~auxiliary construction based on tries and then derive
+the real definition of the Q-heap from it.
+
+\nota
+\itemize\ibull
+\:$W$ --- the word size of the RAM,
+\:$k = \O(W^{1/4})$ --- the limit on the size of the heap,
+\:$n\le k$ --- the number of elements stored in the heap,
+\:$X=\{x_1, \ldots, x_n\}$ --- the elements themselves: distinct $W$-bit numbers
+indexed in a~way that $x_1 < \ldots < x_n$,
+\:$g_i = \<MSB>(x_i \bxor x_{i+1})$ --- the position of the most significant bit in which $x_i$ and~$x_{i+1}$ differ,
+\:$R_X(x)$ --- the rank of~$x$ in~$X$, that is the number of elements of~$X$ which are less than~$x$
+(where $x$~itself need not be an~element of~$X$).\foot{We will dedicate the whole Chapter~\ref{rankchap} to the
+study of various ranks.}
+\endlist
+
+\defn
+A~\df{trie} for a~set of strings~$S$ over a~finite alphabet~$\Sigma$ is
+a~rooted tree whose vertices are the prefixes of the strings in~$S$ and there
+is an~edge going from a~prefix~$\alpha$ to a~prefix~$\beta$ iff $\beta$ can be
+obtained from~$\alpha$ by appending a~single symbol of the alphabet. The edge
+will be labeled with that particular symbol. We will also define the~\df{letter depth}
+of a~vertex as the length of the corresponding prefix. We mark the vertices
+which match a~string of~$S$.
+
+A~\df{compressed trie} is obtained by removing the vertices of outdegree~1
+except for the root and the marked vertices.
+Wherever there is a~directed path whose internal vertices have outdegree~1 and they carry
+no mark, we replace this path by a~single edge labeled with the concatenation
+of the original edges' labels.
+
+In both kinds of tries, we order the outgoing edges of every vertex by their labels
+lexicographically.
+
+\obs
+In both tries, the root of the tree is the empty word. Generally, the prefix
+in a~vertex is equal to the concatenation of edge labels on the path
+leading from the root to that vertex. The letter depth of the vertex is equal to
+the total size of these labels. All leaves correspond to strings in~$S$, but so can
+some internal vertices if there are two strings in~$S$ such that one is a~prefix
+of the other.
+
+Furthermore, the labels of all edges leaving a~common vertex are always
+distinct and when we compress the trie, no two such labels share their initial
+symbols. This allows us to search in the trie efficiently: when looking for
+a~string~$x$, we follow the path from the root and whenever we visit
+an~internal vertex of letter depth~$d$, we test the $d$-th character of~$x$,
+we follow the edge whose label starts with this character, and we check that the
+rest of the label matches.
+
+The compressed trie is also efficient in terms of space consumption --- it has
+$\O(\vert S\vert)$ vertices (this can be easily shown by induction on~$\vert S\vert$)
+and all edge labels can be represented in space linear in the sum of the
+lengths of the strings in~$S$.
+
+\defn
+For our set~$X$, we define~$T$ as a~compressed trie for the set of binary
+encodings of the numbers~$x_i$, padded to exactly $W$~bits, i.e., for $S = \{ \(x)_W \mid x\in X \}$.
+
+\float{\valign{\vfil#\vfil\cr
+\hbox{\epsfbox{pic/qheap.eps}}\cr
+\noalign{\qquad\quad}
+ \halign{#\hfil&\quad#\hfil\cr
+ $x_1 = \0\0\0\0\1\1$ & $g_1=3$ \cr
+ $x_2 = \0\0\1\0\1\0$ & $g_2=4$ \cr
+ $x_3 = \0\1\0\0\0\1$ & $g_3=2$ \cr
+ $x_4 = \0\1\0\1\0\1$ & $g_4=5$ \cr
+ $x_5 = \1\0\0\0\0\0$ & $g_5=0$ \cr
+ $x_6 = \1\0\0\0\0\1$ \cr
+ }\cr
+}}{Six numbers stored in a~compressed trie}
+
+\obs
+The trie~$T$ has several interesting properties. Since all words in~$S$ have the same
+length, the leaves of the trie correspond to these exact words, that is to the numbers~$x_i$.
+The depth-first traversal of the trie enumerates the words of~$S$ in lexicographic order
+and therefore also the~$x_i$'s in the order of their values. Between each
+pair of leaves $x_i$ and~$x_{i+1}$ it visits an~internal vertex whose letter depth
+is exactly~$W-1-g_i$.
+
+\para
+Let us now modify the algorithm for searching in the trie and make it compare
+only the first symbols of the edges. In other words, we will test only the bits~$g_i$
+which will be called \df{guides} (as they guide us through the tree). For $x\in
+X$, the modified algorithm will still return the correct leaf. For all~$x$ outside~$X$
+it will no longer fail and instead it will land on some leaf~$x_i$. At the
+first sight the number~$x_i$ may seem unrelated, but we will show that it can be
+used to determine the rank of~$x$ in~$X$, which will later form a~basis for all
+Q-heap operations:
+
+\lemma\id{qhdeterm}%
+The rank $R_X(x)$ is uniquely determined by a~combination of:
+\itemize\ibull
+\:the trie~$T$,
+\:the index~$i$ of the leaf found when searching for~$x$ in~$T$,
+\:the relation ($<$, $=$, $>$) between $x$ and $x_i$,
+\:the bit position $b=\<MSB>(x\bxor x_i)$ of the first disagreement between~$x$ and~$x_i$.
+\endlist
+
+\proof
+If $x\in X$, we detect that from $x_i=x$ and the rank is obviously~$i-1$.
+Let us assume that $x\not\in X$ and imagine that we follow the same path as when
+searching for~$x$,
+but this time we check the full edge labels. The position~$b$ is the first position
+where~$\(x)$ disagrees with a~label. Before this point, all edges not taken by
+the search were leading either to subtrees containing elements all smaller than~$x$
+or all larger than~$x$ and the only values not known yet are those in the subtree
+below the edge that we currently consider. Now if $x[b]=0$ (and therefore $x<x_i$),
+all values in that subtree have $x_j[b]=1$ and thus they are larger than~$x$. In the other
+case, $x[b]=1$ and $x_j[b]=0$, so they are smaller.
+\qed
+
+\paran{A~better representation}%
+The preceding lemma shows that the rank can be computed in polynomial time, but
+unfortunately the variables on which it depends are too large for a~table to
+be efficiently precomputed. We will carefully choose an~equivalent representation
+of the trie which is compact enough.
+
+\lemma\id{citree}%
+The compressed trie is uniquely determined by the order of the guides~$g_1,\ldots,g_{n-1}$.
+
+\proof
+We already know that the letter depths of the trie vertices are exactly
+the numbers~$W-1-g_i$. The root of the trie must have the smallest of these
+letter depths, i.e., it must correspond to the highest numbered bit. Let
+us call this bit~$g_i$. This implies that the values $x_1,\ldots,x_i$
+must lie in the left subtree of the root and $x_{i+1},\ldots,x_n$ in its
+right subtree. Both subtrees can be then constructed recursively.\foot{This
+construction is also known as the \df{cartesian tree} for the sequence
+$g_1,\ldots,g_{n-1}$ and it is useful in many other algorithms as it can be
+built in $\O(n)$ time. A~nice application on the Lowest Common Ancestor
+and Range Minimum problems has been described by Bender et al.~in \cite{bender:lca}.}
+\qed
+
+\para
+Unfortunately, the vector of the $g_i$'s is also too long (is has $k\log W$ bits
+and we have no upper bound on~$W$ in terms of~$k$), so we will compress it even
+further:
+
+\nota\id{qhnota}%
+\itemize\ibull
+\:$B = \{g_1,\ldots,g_n\}$ --- the set of bit positions of all the guides, stored as a~sorted array,
+\:$G : \{1,\ldots,n\} \rightarrow \{1,\ldots,n\}$ --- a~function mapping
+the guides to their bit positions in~$B$: $g_i = B[G(i)]$,
+\:$x[B]$ --- a~bit string containing the bits of~$x$ originally located
+at the positions given by~$B$, i.e., the concatenation of bits $x[B[1]],
+x[B[2]],\ldots, x[B[n]]$.
+\endlist
+
+\obs\id{qhsetb}%
+The set~$B$ has $\O(k\log W)=\O(W)$ bits, so it can be stored in a~constant number
+of machine words in the form of a~sorted vector. The function~$G$ can be also stored as a~vector
+of $\O(k\log k)$ bits. We can change a~single~$g_i$ in constant time using
+vector operations: First we delete the original value of~$g_i$ from~$B$ if it
+is not used anywhere else. Then we add the new value to~$B$ if it was not
+there yet and we write its position in~$B$ to~$G(i)$. Whenever we insert
+or delete a~value in~$B$, the values at the higher positions shift one position
+up or down and we have to update the pointers in~$G$. This can be fortunately
+accomplished by adding or subtracting a~result of vector comparison.
+
+In this representation, we can reformulate our lemma on ranks as follows:
+
+\lemma\id{qhrank}%
+The rank $R_X(x)$ can be computed in constant time from:
+\itemize\ibull
+\:the function~$G$,
+\:the values $x_1,\ldots,x_n$,
+\:the bit string~$x[B]$,
+\:$x$ itself.
+\endlist
+
+\proof
+Let us prove that all ingredients of Lemma~\ref{qhdeterm} are either small
+enough or computable in constant time.
+
+We know that the shape of the trie~$T$ is uniquely determined by the order of the $g_i$'s
+and therefore by the function~$G$ since the array~$B$ is sorted. The shape of
+the trie together with the bits in $x[B]$ determine the leaf~$x_i$ found when searching
+for~$x$ using only the guides. This can be computed in polynomial time and it
+depends on $\O(k\log k)$ bits of input, so according to Lemma~\ref{qhprecomp}
+we can look it up in a~precomputed table.
+
+The relation between $x$ and~$x_i$ can be obtained directly as we know the~$x_i$.
+The bit position of the first disagreement can be calculated in constant time
+using the Brodnik's LSB/MSB algorithm (\ref{lsb}).
+
+All these ingredients can be stored in $\O(k\log k)$ bits, so we may assume
+that the rank can be looked up in constant time as well.
+\qed
+
+\para
+We would like to store the set~$X$ as a~sorted array together
+with the corresponding trie, which will allow us to determine the position
+for a~newly inserted element in constant time. However, the set is too large
+to fit in a~vector and we cannot perform insertion on an~ordinary array in
+constant time. This can be worked around by keeping the set in an~unsorted
+array and storing a~separate vector containing the permutation that sorts the array.
+We can then insert a~new element at an~arbitrary place in the array and just
+update the permutation to reflect the correct order.
+
+\paran{The Q-heap}%
+We are now ready for the real definition of the Q-heap and for the description
+of the basic operations on it.
+
+\defn
+A~\df{Q-heap} consists of:
+\itemize\ibull
+\:$k$, $n$ --- the capacity of the heap and the current number of elements (word-sized integers),
+\:$X$ --- the set of word-sized elements stored in the heap (an~array of words in an~arbitrary order),
+\:$\varrho$ --- a~permutation on~$\{1,\ldots,n\}$ such that $X[\varrho(1)] < \ldots < X[\varrho(n)]$
+(a~vector of $\O(n\log k)$ bits; we will write $x_i$ for $X[\varrho(i)]$),
+\:$B$ --- a~set of ``interesting'' bit positions
+(a~sorted vector of~$\O(n\log W)$ bits),
+\:$G$ --- the function that maps the guides to the bit positions in~$B$
+(a~vector of~$\O(n\log k)$ bits),
+\:precomputed tables of various functions.
+\endlist
+
+\algn{Search in the Q-heap}\id{qhfirst}%
+\algo
+\algin A~Q-heap and an~integer~$x$ to search for.
+\:$i\=R_X(x)+1$, using Lemma~\ref{qhrank} to calculate the rank.
+\:If $i\le n$ return $x_i$, otherwise return {\sc undefined.}
+\algout The smallest element of the heap which is greater or equal to~$x$.
+\endalgo
+
+\algn{Insertion to the Q-heap}
+\algo
+\algin A~Q-heap and an~integer~$x$ to insert.
+\:$i\=R_X(x)+1$, using Lemma~\ref{qhrank} to calculate the rank.
+\:If $x=x_i$, return immediately (the value is already present).
+\:Insert the new value to~$X$:
+\::$n\=n+1$.
+\::$X[n]\=x$.
+\::Insert~$n$ at the $i$-th position in the permutation~$\varrho$.
+\:Update the $g_j$'s:
+\::Move all~$g_j$ for $j\ge i$ one position up. \hfil\break
+ This translates to insertion in the vector representing~$G$.
+\::Recalculate $g_{i-1}$ and~$g_i$ according to the definition.
+ \hfil\break Update~$B$ and~$G$ as described in~\ref{qhsetb}.
+\algout The updated Q-heap.
+\endalgo
+
+\algn{Deletion from the Q-heap}
+\algo
+\algin A~Q-heap and an~integer~$x$ to be deleted from it.
+\:$i\=R_X(x)+1$, using Lemma~\ref{qhrank} to calculate the rank.
+\:If $i>n$ or $x_i\ne x$, return immediately (the value is not in the heap).
+\:Delete the value from~$X$:
+\::$X[\varrho(i)]\=X[n]$.
+\::Find $j$ such that~$\varrho(j)=n$ and set $\varrho(j)\=\varrho(i)$.
+\::$n\=n-1$.
+\:Update the $g_j$'s like in the previous algorithm.
+\algout The updated Q-heap.
+\endalgo
+
+\algn{Finding the $i$-th smallest element in the Q-heap}\id{qhlast}%
+\algo
+\algin A~Q-heap and an~index~$i$.
+\:If $i<1$ or $i>n$, return {\sc undefined.}
+\:Return~$x_i$.
+\algout The $i$-th smallest element in the heap.
+\endalgo
+
+\paran{Extraction}%
+The heap algorithms we have just described have been built from primitives
+operating in constant time, with one notable exception: the extraction
+$x[B]$ of all bits of~$x$ at positions specified by the set~$B$. This cannot be done
+in~$\O(1)$ time on the Word-RAM, but we can implement it with ${\rm AC}^0$
+instructions as suggested by Andersson in \cite{andersson:fusion} or even
+with those ${\rm AC}^0$ instructions present on real processors (see Thorup
+\cite{thorup:aczero}). On the Word-RAM, we need to make use of the fact
+that the set~$B$ is not changing too much --- there are $\O(1)$ changes
+per Q-heap operation. As Fredman and Willard have shown, it is possible
+to maintain a~``decoder'', whose state is stored in $\O(1)$ machine words
+and which helps us to extract $x[B]$ in a~constant number of operations:
+
+\lemman{Extraction of bits}\id{qhxtract}%
+Under the assumptions on~$k$, $W$ and the preprocessing time as in the Q-heaps,\foot{%
+Actually, this is the only place where we need~$k$ to be as low as $W^{1/4}$.
+In the ${\rm AC}^0$ implementation, it is enough to ensure $k\log k\le W$.
+On the other hand, we need not care about the exponent because it can
+be increased arbitrarily using the Q-heap trees described below.}
+it is possible to maintain a~data structure for a~set~$B$ of bit positions,
+which allows~$x[B]$ to be extracted in $\O(1)$ time for an~arbitrary~$x$.
+When a~single element is inserted to~$B$ or deleted from~$B$, the structure
+can be updated in constant time, as long as $\vert B\vert \le k$.
+
+\proof
+See Fredman and Willard \cite{fw:transdich}.
+\qed
+
+\para
+This was the last missing bit of the mechanics of the Q-heaps. We are
+therefore ready to conclude this section by the following theorem
+and its consequences:
+
+\thmn{Q-heaps, Fredman and Willard \cite{fw:transdich}}\id{qh}%
+Let $W$ and~$k$ be positive integers such that $k=\O(W^{1/4})$. Let~$Q$
+be a~Q-heap of at most $k$-elements of $W$~bits each. Then the Q-heap
+operations \ref{qhfirst} to \ref{qhlast} on~$Q$ (insertion, deletion,
+search for a~given value and search for the $i$-th smallest element)
+run in constant time on a~Word-RAM with word size~$W$, after spending
+time $\O(2^{k^4})$ on the same RAM on precomputing of tables.
+
+\proof
+Every operation on the Q-heap can be performed in a~constant number of
+vector operations and calculations of ranks. The ranks are computed
+in $\O(1)$ steps involving again $\O(1)$ vector operations, binary
+logarithms and bit extraction. All these can be calculated in constant
+time using the results of section \ref{bitsect} and Lemma \ref{qhxtract}.
+\qed
+
+\paran{Combining Q-heaps}%
+We can also use the Q-heaps as building blocks of more complex structures
+like Atomic heaps and AF-heaps (see once again \cite{fw:transdich}). We will
+show a~simpler, but often sufficient construction, sometimes called the \df{Q-heap tree.}
+Suppose we have a~Q-heap of capacity~$k$ and a~parameter $d\in{\bb N}^+$. We
+can build a~balanced $k$-ary tree of depth~$d$ such that its leaves contain
+a~given set and every internal vertex keeps the minimum value in the subtree
+rooted in it, together with a~Q-heap containing the values in all its sons.
+This allows minimum to be extracted in constant time (it is placed in the root)
+and when any element is changed, it is sufficient to recalculate the values
+from the path from this element to the root, which takes $\O(d)$ Q-heap
+operations.
+
+\corn{Q-heap trees}\id{qhtree}%
+For every positive integer~$r$ and $\delta>0$ there exists a~data structure
+capable of maintaining the minimum of a~set of at most~$r$ word-sized numbers
+under insertions and deletions. Each operation takes $\O(1)$ time on a~Word-RAM
+with word size $W=\Omega(r^{\delta})$, after spending time
+$\O(2^{r^\delta})$ on precomputing of tables.