Martin Mares [Wed, 22 Jan 2003 18:07:24 +0000 (18:07 +0000)]
Replaced various attempts to speed up use of obj_add_attr() by simple
internal caching: odes->cached_attr points to the last attribute added
and it's guaranteed to be the last in its chain.
Removed oattr->last_same, the gain isn't worth the extra complexity
involved.
Martin Mares [Wed, 22 Jan 2003 15:51:03 +0000 (15:51 +0000)]
The $(LIBxxx) mechanism proved useful, so I'm switching to it for all other
libraries to simplify the Makefiles a bit. Unfortunately, this introduces
ugly ordering constraints on includes in top-level Makefile, but they can
be lived with.
Martin Mares [Wed, 22 Jan 2003 11:23:19 +0000 (11:23 +0000)]
More configuration enhancements:
o gatherer, indexer and search server can be left out, which can be useful
when using Sherlock for indexing databases, because unusual custom.h
with standard word types missing makes many gatherer modules uncompilable.
o searching by document age is optional, you can switch it off to save
index space.
o indexing of file types is now partially supported by the default configuration,
because I'm going to use the bottom 5 bits of the file_type (which were
used only for images) for storing language code of text documents and
it certainly isn't a centrum-specific thing. On the other hand, I'd like
to keep the exact meaning of file type codes application specific, so the
actual matching of file types is left in the customization header. Again,
you can switch this off to save index space.
Martin Mares [Sun, 5 Jan 2003 11:32:02 +0000 (11:32 +0000)]
When killing dots at the end of host name, remove _all_ of them, not just
the last one. Without this, url_canonicalize on already believed to be
canonic names wasn't constant which causes havoc in gatherd.
Martin Mares [Sun, 27 Oct 2002 20:06:47 +0000 (20:06 +0000)]
Worked around problems with "www.xyz.cz" and "xyz.cz" being considered identical
in the gatherer and different in the indexer by adding a hack to calculation
of fingerprints (we cannot afford calling filters for each fingerprint,
one of reasons being speed, another filters being unavailable in the search
server). Closes bug #302.
Martin Mares [Sun, 27 Oct 2002 13:05:14 +0000 (13:05 +0000)]
Several bug fixes in the logger:
o No more hard limits on log name length.
o If an error occurs during log switching, don't try it again.
o Writes to log files are really atomic, we no more rely on stdio
buffer being large enough (which it isn't).
o Log entries are scanned for control characters which are then mapped to 0x7f.
Martin Mares [Fri, 11 Oct 2002 11:32:36 +0000 (11:32 +0000)]
Added support for shared libraries (CONFIG_SHARED switch in config.mk).
The makefiles are now able to build both static and shared libraries,
objects for shared libraries get suffix '.oo'. Use $(LS) in all references
to libraries, it expands to `.so' if CONFIG_SHARED, to `.a' otherwise.
Martin Mares [Sun, 6 Oct 2002 15:45:07 +0000 (15:45 +0000)]
When bopen() is called with buffer size 0, it switches to bopen_mm().
The plan is to make use of mmapping configurable by the buffer sizes
in cf/sherlock.
Martin Mares [Tue, 24 Sep 2002 21:38:21 +0000 (21:38 +0000)]
After a lot of benchmarking replaced the old super-smart bbcopy()
by a much simpler solution based on the bdirect interface and inlined
the fast path. Surprisingly, the new version is faster under real load
(the explanation is very simple: we use very large buffers for the
indexer and hence the bbcopy optimizations triggered rarely) and it also
works on all fastbuf streams, not only file-based ones.
Martin Mares [Mon, 23 Sep 2002 12:10:25 +0000 (12:10 +0000)]
Adapted the bucket code to new fastbufs. Stream positions are now
always relative to bucket start (originally it was relative to file
start which was completely unusable, and hence unused :-) ).
Martin Mares [Mon, 23 Sep 2002 12:07:15 +0000 (12:07 +0000)]
Major cleanup of fastbufs:
o Split generic fastbuf from low-level routines. `struct fastbuf'
no longer contains low-level data like `fd' or `is_temp_file'.
o Introduced safe type casting macros to avoid programming errors.
o `struct fastbuf' is no longer freed by the high-level code.
o Documented behaviour of bflush() between reads and writes.
o Redefined semantics of fastbuf->pos: it now corresponds to `bstop'
instead of `buffer', hence it always coincides with real file
position, making `fdpos' unnecessary.
Martin Mares [Thu, 19 Sep 2002 18:26:37 +0000 (18:26 +0000)]
Robert> uff, na prvni pohled nic nechapu, asi je to moc chytre :)
So I decided to learn how to use POD and write a POD documentation
for the module (in usual Perl fashion, it's a part of the module).
Use perldoc or pod2${format} to view or convert it.
Martin Mares [Mon, 2 Sep 2002 19:38:09 +0000 (19:38 +0000)]
Added a simple Perl module for connecting to search server and parsing
its results to Perl data structures, converting nested structures and
multiple-valued attributes to arrays.
Also includes the print_tree function which has been originally written
as simple debugging dumper for the parsed query results, but in fact
it's able to dump any complex Perl data structure as long as it's
acyclic.
More to come, including an example (a very simple front-end for the
free version and maybe some more debugging tools).
Martin Mares [Tue, 20 Aug 2002 18:30:54 +0000 (18:30 +0000)]
Added license notices to all library files which are not specific
to Sherlock (and are often shared with other projects) -- they will
be distributed according to the LGPL.
Martin Mares [Tue, 20 Aug 2002 18:14:50 +0000 (18:14 +0000)]
Finally found the cause of make remaking unnecessary files after complete
building from scratch: when it compiles anything by combining several
pattern rules (which exactly what we use), it deletes some of the intermediate
files. The fix is to specify all of these files as ".SECONDARY", but beware,
these special targets don't understand patterns, so we have to list the
intermediates explicitly. Uff.
Martin Mares [Fri, 12 Jul 2002 02:19:23 +0000 (02:19 +0000)]
WORD_TYPES_HIDDEN shouldn't be considered META by default.
WT_LINK shouldn't be considered accent-less. This might cause sherlockd
to fail to find matches in link texts from non-accented documents to
accented ones, but I think that it's more acceptable than producing
false matches. Unfortunately, we how no ways to describe accentedness
of a part of document text.
Martin Mares [Sat, 6 Jul 2002 03:29:41 +0000 (03:29 +0000)]
Increase line buffer sizes to 4096 bytes. Current gatherd really can
produce such long lines under several circumstances, need to examine
how is that possible.
Martin Mares [Fri, 5 Jul 2002 03:23:13 +0000 (03:23 +0000)]
When an inconsistency is encountered while shaking down the bucket
file, recover all data prior to the inconsistency by marking the
space between read and write pointer as deleted buckets (need to
use more of them if the space is too large).
Martin Mares [Sun, 23 Jun 2002 20:32:19 +0000 (20:32 +0000)]
Implemented merging of catalog attributes to the index. Just place the
catalog dump to db/catalog.gz (e.g., by running utils/fetch-cat.sh)
and run the indexer.
Unfortunately, we've just filled up all the available word types :-(