Martin Mares [Sun, 7 Dec 2003 14:23:58 +0000 (14:23 +0000)]
Improved and cleaned up the bucket library. The original "single operation
pending per process" invariant was no longer feasible (and it caused several
problems in Shepherd).
Reading and writing of buckets now uses dynamically allocated fastbufs and
there can be any number of readers at any time, but only a single writer
(otherwise a deadlock would occur). Read streams are seekable, write streams
at least btell()-able.
Also removed the omnipresent global variables for start of current bucket
etc., each part (Find, Slurp, Create, Shakedown, ...) has its own state
variables.
Martin Mares [Sat, 22 Nov 2003 18:21:22 +0000 (18:21 +0000)]
Added very simple functions for emulating a fastbuf stream over a static
buffer. The struct fastbuf is allocated statically as well to make everything
as simple and as fast as possible.
Robert Spalek [Mon, 17 Nov 2003 13:09:44 +0000 (13:09 +0000)]
1. db/catalog.gz ---> db/catalog
+ it is not sent to oook and feedback-cat via pipes, but it is read by them as a file
+ it is read in 2 passes and the URL's are identified in the 1st phase (catalog.c)
2. URL fingerprinting always uses cf/url-equiv, even in the indexer
Martin Mares [Sat, 15 Nov 2003 10:41:41 +0000 (10:41 +0000)]
A better function for hashing integers (the old multiplier was completely
bogus as it didn't fit in a 32-bit integer) and also a new function
for hashing pointers.
Robert Spalek [Thu, 13 Nov 2003 10:43:07 +0000 (10:43 +0000)]
I decided to turn off cf/url-equiv for indexation. however, after the indexer
is run on regular sherlock5, we cannot manually delete this file for indexer
and restore for gatherd. so I am creating a new parameter that controls
loading this prefix table.
Robert Spalek [Mon, 3 Nov 2003 14:35:50 +0000 (14:35 +0000)]
- giant class flag moved from attributes to card-notes
- merger only marks documents by giant flag and
the penalization is done in chewer
- added new weight attribute to cards: Wp means weight after penalization
added penalization notes in the form .Pg-50 (giant class penalized by -50)
- chewer.c: card_write_start() does NOT write struct card_attr and it needs
to be done manually later
- chewer.c: weight records are sorted chronologically, I like it more :-)
Martin Mares [Sat, 11 Oct 2003 10:19:40 +0000 (10:19 +0000)]
Several improvements to the unicode library:
o All tables are now const.
o Redefined the categories:
- now using _U_* instead of _C_*
- introduced _U_LETTER modified with either _U_UPPER or _U_LOWER
or none (titlecase letters, letter modifiers etc.)
o Added the ligature expansions and _U_LIGATURE.
o Minor cleanups.
Martin Mares [Wed, 17 Sep 2003 12:36:44 +0000 (12:36 +0000)]
Replaced enums by #define's in definitions of word, meta and string types.
It's less elegant, but it gives a chance to detect whether a specific type
exists or not.
Martin Mares [Mon, 30 Jun 2003 11:18:57 +0000 (11:18 +0000)]
Several changes mixed to one commit (sorry, the CVS didn't work for a long time):
o Changed index format ID.
o MAX_COMPLEX_LEN went with the rest of complexes.
o Introduced data types and handling macros for context bucket ID's.
o Returned fp_hash() to its original definition -- the previous "fix" was
deadly wrong: I confused indexing of bytes with indexing of words.
Also, the fp_hash() has to be monotonic wrt. fpsort's order which the
new one wasn't.
Robert Spalek [Fri, 27 Jun 2003 12:27:49 +0000 (12:27 +0000)]
added tools for stealing translation tables from recode
sanity checks:
- iso-8859-{1,2} tables are identical after extraction with the tables imported
by MJ
- cp1250 tables is quite different from the existing win-1250 table, but I do
not know which one is right
Martin Mares [Wed, 18 Jun 2003 13:07:10 +0000 (13:07 +0000)]
Added a very simple generic implementation of binomial heaps. Their main
virtue is that they are fully dynamic, needing no upper bounds on the
number of items nor frequent reallocations. Their main disadvantage is
the need of 13 bytes per node.
I did implement only those heap operations I'll use in the gatherer,
I'll add more later.
Martin Mares [Wed, 18 Jun 2003 10:11:12 +0000 (10:11 +0000)]
Minor changes to the RB-tree code:
o Static declarations are much more common, so replace TREE_STATIC
by TREE_GLOBAL (which does just the opposite).
o <xxx> is reserved for system includes, use "xxx" instead.
o Use TREE_TRACE instead of TRACE to avoid collisions with other tracing macros.
Martin Mares [Sun, 15 Jun 2003 20:20:51 +0000 (20:20 +0000)]
Added a straigtforward implementation of circular linked lists. They are
a small bit less efficient than our lists.h lists (testing against zero is
faster than testing against list head), but they are nicer and they save
one pointer per list head which makes them better for hash tables etc.