Martin Mares [Wed, 17 Sep 2003 12:36:44 +0000 (12:36 +0000)]
Replaced enums by #define's in definitions of word, meta and string types.
It's less elegant, but it gives a chance to detect whether a specific type
exists or not.
Martin Mares [Mon, 30 Jun 2003 11:18:57 +0000 (11:18 +0000)]
Several changes mixed to one commit (sorry, the CVS didn't work for a long time):
o Changed index format ID.
o MAX_COMPLEX_LEN went with the rest of complexes.
o Introduced data types and handling macros for context bucket ID's.
o Returned fp_hash() to its original definition -- the previous "fix" was
deadly wrong: I confused indexing of bytes with indexing of words.
Also, the fp_hash() has to be monotonic wrt. fpsort's order which the
new one wasn't.
Robert Spalek [Fri, 27 Jun 2003 12:27:49 +0000 (12:27 +0000)]
added tools for stealing translation tables from recode
sanity checks:
- iso-8859-{1,2} tables are identical after extraction with the tables imported
by MJ
- cp1250 tables is quite different from the existing win-1250 table, but I do
not know which one is right
Martin Mares [Wed, 18 Jun 2003 13:07:10 +0000 (13:07 +0000)]
Added a very simple generic implementation of binomial heaps. Their main
virtue is that they are fully dynamic, needing no upper bounds on the
number of items nor frequent reallocations. Their main disadvantage is
the need of 13 bytes per node.
I did implement only those heap operations I'll use in the gatherer,
I'll add more later.
Martin Mares [Wed, 18 Jun 2003 10:11:12 +0000 (10:11 +0000)]
Minor changes to the RB-tree code:
o Static declarations are much more common, so replace TREE_STATIC
by TREE_GLOBAL (which does just the opposite).
o <xxx> is reserved for system includes, use "xxx" instead.
o Use TREE_TRACE instead of TRACE to avoid collisions with other tracing macros.
Martin Mares [Sun, 15 Jun 2003 20:20:51 +0000 (20:20 +0000)]
Added a straigtforward implementation of circular linked lists. They are
a small bit less efficient than our lists.h lists (testing against zero is
faster than testing against list head), but they are nicer and they save
one pointer per list head which makes them better for hash tables etc.
Martin Mares [Wed, 11 Jun 2003 13:50:09 +0000 (13:50 +0000)]
Functions working with tagged characters moved from index.h to a new
header file tagged-text.h. This also revealed a couple of unintentional
indirect includes.
Martin Mares [Wed, 11 Jun 2003 13:26:04 +0000 (13:26 +0000)]
Split URL fingerprinting inside indexer from the other fingerprints.
URL fingerprints will include server equivalence mappings and other
such hacks (for now the "www." hack), the other fingerprints (used
e.g. for hashing of strings in the index) won't.
Martin Mares [Wed, 11 Jun 2003 13:03:30 +0000 (13:03 +0000)]
Oops, the hash function for fingerprints was terribly biased. There should
be XOR, not OR. Also, the shifts are meaningless, because the fingerprint
hash is believed to be very well distributed.
Beware, this means that the current mainline is incompatible with string
indices generated by v2.4! For now, I'm not increasing the index version,
because word matching still works with old indices and I want to profile it.
Martin Mares [Wed, 4 Jun 2003 19:31:10 +0000 (19:31 +0000)]
Make sherlockd calculate per-filetype number of matched documents, including
those failing the FILETYPE filter. This breaks the nice abstraction of hiding
all filtering under EXTENDED_ATTRS, but it will allow us to get rid of lots
of STATS queries.
Martin Mares [Fri, 11 Apr 2003 17:04:02 +0000 (17:04 +0000)]
Added a new PURE attribute which means "this function can read global variables,
but it doesn't have any side effects" as opposed to CONST which promises
that no global variables will be touched.
Martin Mares [Mon, 24 Mar 2003 18:24:37 +0000 (18:24 +0000)]
More changes to the custom attribute mechanism:
o Introduced the concept of extended attributes which consist of the
custom attributes and some internally defined attributes handled
in the same way.
o FILETYPE and LANG are now extended attributes and they don't depend
on the customization module. This is probably much cleaner as it
reduces the overlap between custom parts and generic parts.
o No problems with linking liblang because of LANG attribute.
o idxdump got simplified.
Martin Mares [Wed, 5 Mar 2003 18:09:05 +0000 (18:09 +0000)]
Fixed bug in bucket shakedown code: it crashed with a mysterious error message
("Unexpected EOF") when there was a bucket larger than the shakedown buffer
and this included even deleted buckets (which was the cause of the latest crash
on sherlock5: shakedown on a corrupted database patched the corruption by
a large deleted bucket and when I ran it again, it crashed again due to this
bucket).
Now we are able to cope with deleted buckets of any size and when we
encounter an oversized non-deleted bucket, we bail out with a proper
error message.
Martin Mares [Fri, 28 Feb 2003 16:51:14 +0000 (16:51 +0000)]
Put back the work-around for objects generated by an old version of the
gatherer, because some of them still haven't expired from the db.
At least the work-around is cleaner this time.
Martin Mares [Fri, 28 Feb 2003 14:21:24 +0000 (14:21 +0000)]
Changed processing of configuration files.
run/cf is no longer a symlink to ../cf, I've replaced it by make
rules which generate the configuration files in run/cf by preprocessing
those in cf according to CONFIG_xxx switches in config.mk (in the same
way as we already do in mkdist).
I'd like to migrate many settings local to the Centrum configs to the
main CVS without having to update several separate copies of the config.
*** CAVEAT *** After updating to this version, you need to either
make distclean or
rm run/cf
mkdir run/cf
manually _before_ running make, else could lose your config files.
Martin Mares [Wed, 5 Feb 2003 18:15:24 +0000 (18:15 +0000)]
Tried to use libm for calculating logarithmic frequencies of words,
but ran into problems with function name collisions. Damn the C's flat
namespace!
Renamed our log to log_msg, but keep the original name as a macro
expanding to the new one. Also renamed log2 (which is currently not used
anywhere) to fls (find last set, akin to ffs).
Introduced lib/math.h which is a wrapper around <math.h> handling
name collisions by clever macro tricks.
Martin Mares [Mon, 27 Jan 2003 13:49:10 +0000 (13:49 +0000)]
Added another version of bgets() which doesn't die on too long lines and
reports an error instead.
I wrote it originally for new http.c, but the required http.c changes were
going to be too extensive, so I postponed the changes and this function
is currently unused, but probably worth saving for the future.
Also optimized the existing bgets functions a bit.
Martin Mares [Wed, 22 Jan 2003 21:24:52 +0000 (21:24 +0000)]
Added obj_write_nocheck which writes the object as quickly as possible,
avoiding checks for strange chars which are probably useful only in the
gatherer anyway.
Martin Mares [Wed, 22 Jan 2003 18:07:24 +0000 (18:07 +0000)]
Replaced various attempts to speed up use of obj_add_attr() by simple
internal caching: odes->cached_attr points to the last attribute added
and it's guaranteed to be the last in its chain.
Removed oattr->last_same, the gain isn't worth the extra complexity
involved.
Martin Mares [Wed, 22 Jan 2003 15:51:03 +0000 (15:51 +0000)]
The $(LIBxxx) mechanism proved useful, so I'm switching to it for all other
libraries to simplify the Makefiles a bit. Unfortunately, this introduces
ugly ordering constraints on includes in top-level Makefile, but they can
be lived with.
Martin Mares [Wed, 22 Jan 2003 11:23:19 +0000 (11:23 +0000)]
More configuration enhancements:
o gatherer, indexer and search server can be left out, which can be useful
when using Sherlock for indexing databases, because unusual custom.h
with standard word types missing makes many gatherer modules uncompilable.
o searching by document age is optional, you can switch it off to save
index space.
o indexing of file types is now partially supported by the default configuration,
because I'm going to use the bottom 5 bits of the file_type (which were
used only for images) for storing language code of text documents and
it certainly isn't a centrum-specific thing. On the other hand, I'd like
to keep the exact meaning of file type codes application specific, so the
actual matching of file types is left in the customization header. Again,
you can switch this off to save index space.
Martin Mares [Sun, 5 Jan 2003 11:32:02 +0000 (11:32 +0000)]
When killing dots at the end of host name, remove _all_ of them, not just
the last one. Without this, url_canonicalize on already believed to be
canonic names wasn't constant which causes havoc in gatherd.