Robert Spalek [Mon, 14 Jun 2004 09:58:37 +0000 (09:58 +0000)]
debugged, now it is fully functional:
- a lot of typos (especially priorities of operators in C and variable name
mismatches)
- bit-format errors (forgotten additive constants or negations)
- do not use hash_rec[0]
- wrong entries in the collision link-lists must NOT appear in the beginning
==> saved time when verifying and solved some strange cases
- changed constants determining the maximum prolong-factor
- added a simple test-tool
Robert Spalek [Mon, 19 Apr 2004 16:41:45 +0000 (16:41 +0000)]
- 0x08 (BACKSPACE) is a blank character and it is accepted as an ASCII-character
- 0x7f is also accepted as an ASCII-character
- both gather/content.c and gather/charset.c now use the same function
Cblank() to test it
Martin Mares [Sun, 18 Apr 2004 13:39:32 +0000 (13:39 +0000)]
Changed locking rules. Scans and appends can peacefully co-exist now.
Should solve the problems with shep-reap waiting for bucket file transmission
to finish.
Martin Mares [Sat, 10 Apr 2004 20:36:01 +0000 (20:36 +0000)]
Multi-part objects (with header and body separated by an empty line and terminated
either by EOF or by a NUL byte) are very common, so let's introduce a special
function for reading them.
Martin Mares [Thu, 8 Apr 2004 22:18:19 +0000 (22:18 +0000)]
More enhancement to the main loop library: Export all lists for easy inspection
(reading only) by the callers. When a process exits, construct a nice tombstone
string for it.
Martin Mares [Wed, 7 Apr 2004 22:03:30 +0000 (22:03 +0000)]
Added a universal main loop with timers, file descriptor polling and process
watching. Inspired by the glib main loop, but this one has a much nicer
interface.
It will be used in the Shepherd master and if it turns out to be useful,
I'll convert the other programs to use it some day.
Martin Mares [Sun, 14 Mar 2004 12:58:40 +0000 (12:58 +0000)]
Our regex functions are now able to interface to old-style BSD re_match(),
to POSIX regexec() and to libpcre. Currently it's switched to the BSD mode
as before, I'll look at it more in the evening.
Martin Mares [Tue, 2 Mar 2004 15:38:20 +0000 (15:38 +0000)]
When we try to create a temporary file and it already exists (which can happen
if a program with the same PID has crashed at some time in the past), don't
panic and rewrite the file. Should be safe since we're using our own tmp directory
nobody else can access.
Martin Mares [Sat, 28 Feb 2004 10:49:48 +0000 (10:49 +0000)]
Hopefully finally sorted out the "http://www.xyz.cz?param" mess. The true
semantics turned out to be "http://www.xyz.cz/?param" and most web servers
really require "GET /?param".
I've changed the normalization rules to add the leading slash if needed
which also solves the relative URL problem I mentioned in the comments.
However, this means that the SEMANTICS OF NORMALIZED URL'S HAS CHANGED
and gatherer databases with URL's in the "http://www.xyz.cz?param" form
are now INVALID. I'm going to delete all such URL's from our gatherer now.
Martin Mares [Tue, 24 Feb 2004 18:36:23 +0000 (18:36 +0000)]
Blank lines are considered separators, not terminators of buckets.
Hence extraneous blank lines between buckets and trailing blank lines
after the last buckets are all ignored.
Martin Mares [Thu, 22 Jan 2004 11:21:36 +0000 (11:21 +0000)]
Use int instead of pid_t. At the first glance, this looks like a step backward,
but since we use the variable for printing with a "%d" format-string anyway
and there is no way how to get the right format string for pid_t, it's better
this way.
Martin Mares [Sun, 11 Jan 2004 19:03:21 +0000 (19:03 +0000)]
Rewritten shake down of bucket file.
o Replaced read and write buffers by a single shared buffer.
This should be somewhat faster (with the same size of memory invested
to buffers).
o If ShakeSecurity is set to 2, shaking down should be reliable under
all circumstances, including server reboots and broken bucket files.
Buckettool -F still needs to be run after a failed shakedown and
oid's need to be synchronized with the outside world, but no buckets
will be lost (only some of them may be duplicated).
o The callback function (`the kibitz') is now allowed not only to decide
which buckets will be kept, but also to alter contents of the buckets
provided that it won't enlarge the bucket.
I tried to be very careful and tested the new routine thoroughly, but since
it's a pretty critical place, I would be very happy if somebody checks it
independently.
Martin Mares [Sat, 10 Jan 2004 13:41:09 +0000 (13:41 +0000)]
When pre-sorting a regular file, use lib/arraysort.h on an array of items
instead of the default merge-sort type algorithm working with linked lists.
This is much faster -- e.g., the sorting in shep-export on the current
Sherlock3 database now takes 54 sec instead of 669 :-)
However, to accomplish this I had to change two invariants:
(1) SORT_REGULAR now means not only that the input has regular structure,
but also that each item is reasonably small (i.e., we can use
sorting by exchanging in place).
(2) If SORT_PRESORT is enabled, the comparison function can be called
with both keys equal. This trips ASSERT's on various place which
originally helped a lot during debugging, so I decided to add
a SORT_UNIQUE switch which in DEBUG mode causes the sorter to
ensure that all keys are distinct, so we can remove the ASSERT's.
As both the Shepherd and the Indexer now rely heavily on sorting, it might
be worth a try to optimize the sorter even further, maybe by utilizing
polyphase sorting or something like that, the run sizes really seem to be
distributed unevenly many times.
Martin Mares [Sun, 7 Dec 2003 14:23:58 +0000 (14:23 +0000)]
Improved and cleaned up the bucket library. The original "single operation
pending per process" invariant was no longer feasible (and it caused several
problems in Shepherd).
Reading and writing of buckets now uses dynamically allocated fastbufs and
there can be any number of readers at any time, but only a single writer
(otherwise a deadlock would occur). Read streams are seekable, write streams
at least btell()-able.
Also removed the omnipresent global variables for start of current bucket
etc., each part (Find, Slurp, Create, Shakedown, ...) has its own state
variables.