Martin Mares [Wed, 11 Jun 2003 13:50:09 +0000 (13:50 +0000)]
Functions working with tagged characters moved from index.h to a new
header file tagged-text.h. This also revealed a couple of unintentional
indirect includes.
Martin Mares [Wed, 11 Jun 2003 13:26:04 +0000 (13:26 +0000)]
Split URL fingerprinting inside indexer from the other fingerprints.
URL fingerprints will include server equivalence mappings and other
such hacks (for now the "www." hack), the other fingerprints (used
e.g. for hashing of strings in the index) won't.
Martin Mares [Wed, 11 Jun 2003 13:03:30 +0000 (13:03 +0000)]
Oops, the hash function for fingerprints was terribly biased. There should
be XOR, not OR. Also, the shifts are meaningless, because the fingerprint
hash is believed to be very well distributed.
Beware, this means that the current mainline is incompatible with string
indices generated by v2.4! For now, I'm not increasing the index version,
because word matching still works with old indices and I want to profile it.
Martin Mares [Wed, 4 Jun 2003 19:31:10 +0000 (19:31 +0000)]
Make sherlockd calculate per-filetype number of matched documents, including
those failing the FILETYPE filter. This breaks the nice abstraction of hiding
all filtering under EXTENDED_ATTRS, but it will allow us to get rid of lots
of STATS queries.
Martin Mares [Fri, 11 Apr 2003 17:04:02 +0000 (17:04 +0000)]
Added a new PURE attribute which means "this function can read global variables,
but it doesn't have any side effects" as opposed to CONST which promises
that no global variables will be touched.
Martin Mares [Mon, 24 Mar 2003 18:24:37 +0000 (18:24 +0000)]
More changes to the custom attribute mechanism:
o Introduced the concept of extended attributes which consist of the
custom attributes and some internally defined attributes handled
in the same way.
o FILETYPE and LANG are now extended attributes and they don't depend
on the customization module. This is probably much cleaner as it
reduces the overlap between custom parts and generic parts.
o No problems with linking liblang because of LANG attribute.
o idxdump got simplified.
Martin Mares [Wed, 5 Mar 2003 18:09:05 +0000 (18:09 +0000)]
Fixed bug in bucket shakedown code: it crashed with a mysterious error message
("Unexpected EOF") when there was a bucket larger than the shakedown buffer
and this included even deleted buckets (which was the cause of the latest crash
on sherlock5: shakedown on a corrupted database patched the corruption by
a large deleted bucket and when I ran it again, it crashed again due to this
bucket).
Now we are able to cope with deleted buckets of any size and when we
encounter an oversized non-deleted bucket, we bail out with a proper
error message.
Martin Mares [Fri, 28 Feb 2003 16:51:14 +0000 (16:51 +0000)]
Put back the work-around for objects generated by an old version of the
gatherer, because some of them still haven't expired from the db.
At least the work-around is cleaner this time.
Martin Mares [Fri, 28 Feb 2003 14:21:24 +0000 (14:21 +0000)]
Changed processing of configuration files.
run/cf is no longer a symlink to ../cf, I've replaced it by make
rules which generate the configuration files in run/cf by preprocessing
those in cf according to CONFIG_xxx switches in config.mk (in the same
way as we already do in mkdist).
I'd like to migrate many settings local to the Centrum configs to the
main CVS without having to update several separate copies of the config.
*** CAVEAT *** After updating to this version, you need to either
make distclean or
rm run/cf
mkdir run/cf
manually _before_ running make, else could lose your config files.
Martin Mares [Wed, 5 Feb 2003 18:15:24 +0000 (18:15 +0000)]
Tried to use libm for calculating logarithmic frequencies of words,
but ran into problems with function name collisions. Damn the C's flat
namespace!
Renamed our log to log_msg, but keep the original name as a macro
expanding to the new one. Also renamed log2 (which is currently not used
anywhere) to fls (find last set, akin to ffs).
Introduced lib/math.h which is a wrapper around <math.h> handling
name collisions by clever macro tricks.
Martin Mares [Mon, 27 Jan 2003 13:49:10 +0000 (13:49 +0000)]
Added another version of bgets() which doesn't die on too long lines and
reports an error instead.
I wrote it originally for new http.c, but the required http.c changes were
going to be too extensive, so I postponed the changes and this function
is currently unused, but probably worth saving for the future.
Also optimized the existing bgets functions a bit.
Martin Mares [Wed, 22 Jan 2003 21:24:52 +0000 (21:24 +0000)]
Added obj_write_nocheck which writes the object as quickly as possible,
avoiding checks for strange chars which are probably useful only in the
gatherer anyway.
Martin Mares [Wed, 22 Jan 2003 18:07:24 +0000 (18:07 +0000)]
Replaced various attempts to speed up use of obj_add_attr() by simple
internal caching: odes->cached_attr points to the last attribute added
and it's guaranteed to be the last in its chain.
Removed oattr->last_same, the gain isn't worth the extra complexity
involved.
Martin Mares [Wed, 22 Jan 2003 15:51:03 +0000 (15:51 +0000)]
The $(LIBxxx) mechanism proved useful, so I'm switching to it for all other
libraries to simplify the Makefiles a bit. Unfortunately, this introduces
ugly ordering constraints on includes in top-level Makefile, but they can
be lived with.
Martin Mares [Wed, 22 Jan 2003 11:23:19 +0000 (11:23 +0000)]
More configuration enhancements:
o gatherer, indexer and search server can be left out, which can be useful
when using Sherlock for indexing databases, because unusual custom.h
with standard word types missing makes many gatherer modules uncompilable.
o searching by document age is optional, you can switch it off to save
index space.
o indexing of file types is now partially supported by the default configuration,
because I'm going to use the bottom 5 bits of the file_type (which were
used only for images) for storing language code of text documents and
it certainly isn't a centrum-specific thing. On the other hand, I'd like
to keep the exact meaning of file type codes application specific, so the
actual matching of file types is left in the customization header. Again,
you can switch this off to save index space.
Martin Mares [Sun, 5 Jan 2003 11:32:02 +0000 (11:32 +0000)]
When killing dots at the end of host name, remove _all_ of them, not just
the last one. Without this, url_canonicalize on already believed to be
canonic names wasn't constant which causes havoc in gatherd.
Martin Mares [Sun, 27 Oct 2002 20:06:47 +0000 (20:06 +0000)]
Worked around problems with "www.xyz.cz" and "xyz.cz" being considered identical
in the gatherer and different in the indexer by adding a hack to calculation
of fingerprints (we cannot afford calling filters for each fingerprint,
one of reasons being speed, another filters being unavailable in the search
server). Closes bug #302.
Martin Mares [Sun, 27 Oct 2002 13:05:14 +0000 (13:05 +0000)]
Several bug fixes in the logger:
o No more hard limits on log name length.
o If an error occurs during log switching, don't try it again.
o Writes to log files are really atomic, we no more rely on stdio
buffer being large enough (which it isn't).
o Log entries are scanned for control characters which are then mapped to 0x7f.
Martin Mares [Fri, 11 Oct 2002 11:32:36 +0000 (11:32 +0000)]
Added support for shared libraries (CONFIG_SHARED switch in config.mk).
The makefiles are now able to build both static and shared libraries,
objects for shared libraries get suffix '.oo'. Use $(LS) in all references
to libraries, it expands to `.so' if CONFIG_SHARED, to `.a' otherwise.
Martin Mares [Sun, 6 Oct 2002 15:45:07 +0000 (15:45 +0000)]
When bopen() is called with buffer size 0, it switches to bopen_mm().
The plan is to make use of mmapping configurable by the buffer sizes
in cf/sherlock.
Martin Mares [Tue, 24 Sep 2002 21:38:21 +0000 (21:38 +0000)]
After a lot of benchmarking replaced the old super-smart bbcopy()
by a much simpler solution based on the bdirect interface and inlined
the fast path. Surprisingly, the new version is faster under real load
(the explanation is very simple: we use very large buffers for the
indexer and hence the bbcopy optimizations triggered rarely) and it also
works on all fastbuf streams, not only file-based ones.
Martin Mares [Mon, 23 Sep 2002 12:10:25 +0000 (12:10 +0000)]
Adapted the bucket code to new fastbufs. Stream positions are now
always relative to bucket start (originally it was relative to file
start which was completely unusable, and hence unused :-) ).
Martin Mares [Mon, 23 Sep 2002 12:07:15 +0000 (12:07 +0000)]
Major cleanup of fastbufs:
o Split generic fastbuf from low-level routines. `struct fastbuf'
no longer contains low-level data like `fd' or `is_temp_file'.
o Introduced safe type casting macros to avoid programming errors.
o `struct fastbuf' is no longer freed by the high-level code.
o Documented behaviour of bflush() between reads and writes.
o Redefined semantics of fastbuf->pos: it now corresponds to `bstop'
instead of `buffer', hence it always coincides with real file
position, making `fdpos' unnecessary.
Martin Mares [Thu, 19 Sep 2002 18:26:37 +0000 (18:26 +0000)]
Robert> uff, na prvni pohled nic nechapu, asi je to moc chytre :)
So I decided to learn how to use POD and write a POD documentation
for the module (in usual Perl fashion, it's a part of the module).
Use perldoc or pod2${format} to view or convert it.