4 A very common need is sorting data. Therefore libUCW contains few
5 routines to accomplish that task. They are much more universal than
6 qsort(), since they allow you to sort structures indexed by a macro,
7 sort data externally, if they do not fit into memory, merge data with
8 the same keys and sort data of variable length.
10 All routines described below are <<generic:,generic algorithms>>.
12 - <<array-simple,Simple array sorting>>
13 * <<mandatory-simple,Mandatory macros>>
14 * <<optional-simple,Optional macros>>
15 * <<example-simple,Example>>
16 - <<array,Huge array sorting>>
17 * <<mandatory-array,Mandatory macros>>
18 * <<optional-array,Optional macros>>
19 - <<external,External sorting>>
20 * <<basic-external,Basic macros>>
21 * <<callback-external,Callbacks>>
22 * <<integer-external,Integer sorting>>
23 * <<hash-external,Hashing>>
24 * <<merge-external,Merging>>
25 * <<input-external,Input>>
26 * <<output-external,Output>>
27 * <<other-external,Other switches>>
28 * <<function-external,Generated function>>
34 If you want to sort some data in memory and you aren't too picky about
35 setting how, you just use the routine defined in
36 `sorter/array-simple.h`. It is an optimised hybrid
37 quick-sort/insert-sort algorithm (quick-sort is used to split the
38 input into small parts, each is then sorted by insert-sort). It is
39 more than 2 times faster than stdlib's qsort(), mostly because of
42 You need to define few macros and include the header. You get a
43 sorting function in return. It will be called
44 <<fun__GENERIC_LINK_|ASORT_PREFIX|sort|,`ASORT_PREFIX(sort)`>>.
49 - `ASORT_PREFIX(name)` -- The identifier generating macro.
50 - `ASORT_KEY_TYPE` -- Data type of a single array entry key.
55 - `ASORT_ELT(i)` -- Indexing macro. Returns the key of the
56 corresponding entry. If not provided, usual array with sequential
58 - `ASORT_LT(x,y)` -- Comparing macro. If not provided, compares by the
60 - `ASORT_SWAP(i,j)` -- Swap elements with indices `i` and `j`. If not
61 provided, it assumes `ASORT_ELT` is l-value and it just swaps keys.
62 - `ASORT_TRESHOLD` -- Sequences of at most this amount of elements are
63 sorted by quick-sort, smaller are sorted by insert-sort. Defaults to
64 `8` (result of experimentation).
65 - `ASORT_EXTRA_ARGS` -- Pass some extra arguments to the function.
66 They are visible from all the macros. Must start with a comma.
68 !!ucw/sorter/array-simple.h ASORT_PREFIX
74 Let's sort an array of integers, in the usual way.
76 #define ASORT_PREFIX(X) intarr_##X
77 #define ASORT_TYPE int
78 #include <ucw/sorter/array-simple.h>
80 This generates an intarr_sort(int *array, uns array_size) function that
81 can be used the obvious way.
83 A more complicated example could be sorting a structure, where items
84 with odd indices are stored in one array, even in another. Each item
85 could be a structure containing a string and an integer. We would like
86 to sort them by the strings.
93 #include <string.h> // Because of strcmp
94 #define ASORT_PREFIX(X) complicated_##X
95 #define ASORT_TYPE struct elem
96 #define ASORT_ELT(i) ((i % 2 ? even_array : odd_array)[i / 2])
97 #define ASORT_LT(x, y) (strcmp((x).string, (y).string) < 0)
98 #define ASORT_EXTRA_ARGS , struct elem *odd_array, struct elem *even_array
99 #include <ucw/sorter/sorter/array-simple.h>
101 Now we got a complicated_sort(uns array_size, struct elem *odd_array,
102 struct *even_array) function to perform our sorting.
108 This one is very similar to the simple array sorter, but it is
109 optimised for huge arrays. It is used mostly by the
110 <<external,external sorter>> machinery described below, but you can
113 It is in the `sorter/array.h` header.
115 It differs in few details:
116 - It supports only continuous arrays, no indexing macro can be
118 - It is able to sort in parallel on SMP systems. It assumes all
119 callbacks you provide are thread-safe.
120 - If you provide a monotone hash function (if `hash(x) < hash(y)`, then
121 `x < y`, but `x` and `y` may differ when `hash(x) == hash(y)`), it
122 will use it to gain some more speed by radix-sort.
128 - `ASORT_PREFIX(x)` -- The identifier generating macro.
129 - `ASORT_KEY_TYPE` -- Type of elements in the array.
135 - `ASORT_LT(x,y)` -- Comparing macro. Uses the `<` operator if not
137 - `ASORT_HASH(x)` -- A monotone hash function (or macro). Should
139 - `ASORT_LONG_HASH(x)` -- Like `ASORT_HASH(x)`, but returns 64-bit
140 number instead of 32-bit.
141 - `ASORT_TRESHOLD` -- How small should a chunk of data be to be sorted
142 by insert-sort? Defaults to `8` elements.
143 - `ASORT_RADIX_BITS` -- How many bits of the hash function should be
144 used at once for radix-sort? The default is guessed from your
147 !!ucw/sorter/array.h ASORT_PREFIX
153 If you have too much data to fit into memory, you need to employ
154 external sorting. This external sorter operates on
155 <<fastbuf:,fastbufs>> containing sequences of items. Each item
156 consists of a key, optionally followed by data. Both the keys and data
157 may be of variable length, but the keys must be represented by
158 fixed-size type in memory. The length of data must be computable from
159 the key. Data are just copied verbatim, unless you use the merging
160 mode, in which data with the same keys get merged together.
162 All callbacks must be thread safe.
164 The sorter resides in the `sorter/sorter.h` header file.
170 You need to provide some basic macros. Some of them are optional.
172 - `SORT_PREFIX(x)` -- Identifier generating macro. This one is
174 - `SORT_KEY` -- Data structure holding the key of item in memory. The
175 representation on disk may be different. Either this one or
176 `SORT_KEY_REGULAR` must be provided.
177 - `SORT_KEY_REGULAR` -- You may use this instead of `SORT_KEY`, when
178 the keys have the same representation both in memory and on disk.
179 Then the sorter uses <<fastbuf:bread()>> and <<fastbuf:bwrite()>> to
180 load and store them. It also assumes the keys are not very long.
181 - `SORT_KEY_SIZE(key)` -- Returns the real size of the key. The sorter
182 can use this to save space and truncate the key to the given number
183 of bytes, when the keys have variable lengths. If the keys have
184 fixed sizes, there is no need for this macro.
185 - `SORT_DATA_SIZE(key)` -- Returns the amount of data following this
186 key. If you do not provide this one, the sorter assumes there are
187 only keys and no data.
189 [[callback-external]]
193 Furthermore, you need to provide these callback functions (make sure
194 they are thread safe):
196 - `int SORT_PREFIX(compare)(SORT_KEY *a, SORT_KEY *b)` -- Comparing
197 function. It should act like strcmp(). Mandatory unless provided by
198 <<integer-external,integer sorting>>.
199 - `int SORT_PREFIX(read_key)(struct fastbuf *f, SORT_KEY *k)` --
200 Should read a key from the provided <<fastbuf:,fastbuf>> @f and
201 store it into @k. Returns nonzero when ok and zero when an `EOF` was
202 met. Mandatory unless `SORT_KEY_REGULAR` is defined.
203 - `void SORT_PREFIX(write_key)(struct fastbuf *f, SORT_KEY *k)` --
204 Should store key @k into @f. Mandatory unless `SORT_KEY_REGULAR` is
211 If you sort by an integer value (either computed or available from
212 the key), you can use this to save yourself some functions. It also
213 activates the <<hash-external,hashing>> automatically.
215 - `SORT_INT(key)` -- This macro returns the integer to sort by. When
216 you provide it, the compare function is automatically provided for
217 you and the sorting function gets another parameter specifying the
218 range of the integers. The better the range fits, the faster the
220 - `SORT_INT64(key)` -- The same, but with 64-bit integers.
226 If you have a monotone hash function for your keys, you may speed the
227 sorting up by providing it. Monotone hashing function must satisfy if
228 `hash(x) < hash(y)`, then `x < y`. It should be approximately
229 uniformly distributed.
231 When you want to use it, define `SORT_HASH_BITS` and set it to the
232 number of significant bits the hashing function provides. Then provide
233 a callback function `uns SORT_PREFIX(hash)(SORT_KEY *key)`.
236 Merging items with identical keys
237 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
239 The sorter is able to merge items with the same keys (the compare
240 function returns `0` for them). To use it, define `SORT_UNIFY` macro
241 and provide these functions:
243 - `void SORT_PREFIX(write_merged)(struct fastbuf \*dest, SORT_KEY
244 \*\*keys, void \*\*data, uns n, void *buf)`
245 -- This function takes @n records in memory and writes a single
246 record into the @dest <<fastbuf:,fastbuf>>. The @keys and @data are
247 just the records. The @buf parameter points to a workspace memory.
248 It is guaranteed to hold at last the sum of `SUM_UNIFY_WORKSPACE()`
249 macro over all the keys. The function is allowed to modify all its
251 - `void SORT_PREFIX(copy_merged)(SORT_KEY \*\*keys, struct fastbuf
252 \*\*data, uns n, struct fastbuf \*dest)`
253 -- This one is similar to the above one, but the data are still in
254 the <<fastbuf:,fastbufs>> @data and no workspace is provided. This
255 is only used when `SORT_DATA_SIZE` or `SORT_UNIFY_WORKSPACE` is
257 - `SORT_UNIFY_WORKSPACE(key)` -- Returns the amount of workspace
258 needed when merging this record. Defaults to `0`.
264 To tell the sorter where is the input, you specify one of these
267 - `SORT_INPUT_FILE` -- The function takes a filename.
268 - `SORT_INPUT_FB` -- The input is a seekable fastbuf stream.
269 - `SORT_INPUT_PIPE` -- The input is a non-seekable fastbuf stream.
270 - `SORT_INPUT_PRESORT` -- The input is a custom presorter. In this
271 case, you need to write a presorting function `int
272 SORT_PREFIX(presort)(struct fastbuf *dest, void *buf, size_t
273 bufsize)`. The function gets a buffer @buf of size @buf_size to
274 presort in and is supposed to write presorted bunch of data into the
275 @dest buffer. Should return `1` on success or `0` on `EOF` (all it
276 could was already written, no more data). In this case, you can
277 safely pass NULL as the input parameter. The function may be used to
278 generate the data on the fly. The function does not have to be
279 thread safe (it can access global variables).
281 If you define `SORT_DELETE_INPUT` and it evaluates to true (nonzero),
282 the input files are deleted as soon as possible.
288 You can configure the output in a similar way. Define one of macros:
290 - `SORT_OUTPUT_FILE` -- The function takes a filename.
291 - `SORT_OUTPUT_FB` -- The function should be provided with NULL and
292 the fastbuf with data is returned.
293 - `SORT_THIS_FB` -- A fastbuf is provided to the function and it
294 writes into it. It can already contain some data.
300 You may define the `SORT_UNIQUE` macro if all keys are distinct. It is
301 checked in debug mode.
303 [[function-external]]
304 The generated function
305 ~~~~~~~~~~~~~~~~~~~~~~
307 A `SORT_PREFIX(sort)()` function is generated after you include the
308 `sorter/sorter.h` header. It has up to three parameters:
310 - Input. It is either a string (a filename) if you use
311 `SORT_INPUT_FILE` or a fastbuf (otherwise). It should be set to NULL
312 if you use the `SORT_INPUT_PRESORT` input.
313 - Output. It is either a string (a filename) if you defined the
314 `SORT_OUTPUT_FILE` or a fastbuf. It must be NULL if you defined
316 - Integer range. The maximum value of integers that are used in the
317 <<integer-external,integer sorting>>. This parameter is here only
318 if you defined `SORT_INT` or `SORT_INT64`.
320 The function returns a fastbuf you can read the data from.