lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marvin Humphrey (JIRA)" <>
Subject [jira] Closed: (LUCENE-851) Pruning
Date Fri, 24 Sep 2010 01:19:33 GMT


Marvin Humphrey closed LUCENE-851.

    Resolution: Duplicate

Yes, LUCENE-2482 introduces the index sorter from Nutch that I referred to in
this issue. The termination mechanism is slightly different
(TimeLimitedCollector vs. X hits per segment), but it's the sorter that really

I'm closing as Duplicate. Thanks for digging this up!

> Pruning
> -------
>                 Key: LUCENE-851
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index, Search
>            Reporter: Marvin Humphrey
>            Priority: Minor
> Greets,
> A thread on java-dev a couple of months ago drew my attention to a technique used by
Nutch for cutting down the number of hits that have to be processed:  if you have an algorithm
for ordering documents by importance, and you sort them so that the lowest document numbers
have the highest rank, then most of your high-scoring hits are going to occur early on in
the hit-collection process.  Say you're looking for the top 100 matches -- the odds are pretty
good that after you've found 1000 hits, you've gotten most of the good stuff.  It may not
be necessary to score the other e.g. 5,000,000 hits.
> To pull this off in Nutch, they run the index through a post process whereby documents
are re-ordered by page score using the IndexSorter class.  Unfortunately, post-processing
does not live happily with incremental indexing.  
> However, if we ensure that document numbers are ordered according to our criteria within
each segment, that's almost as good.
> Say we're looking for 100 hits, as before; what we do is collect a maximum of 1000 hits
per segment.  If we are dealing with an index made up of 25 segments, that's 25,000 hits max
we'll have to process fully -- the rest we can skip over.  That's not as quick as only processing
1000 hits then stopping in a fully optimized index, but it's a lot better than churning through
all 5,000,000 hits.
> A lot of those hits from the smallest segments will be garbage; we'll get most of our
good hits from a few large segments most of the time.  But that's fine -- the cost to process
any one segment is small.
> Writing a low-level scoring loop which implements pruning per segment is straightforward.
 KinoSearch's version (in C) is below.
> To control the amount of pruning, we need a high-level Searcher.setPruneFactor API, which
sets a multiplier; the number of hits-per-segment which must be processed is determined by
multiplying the number of hits you need by pruneFactor.  Here's code from KS for deriving
>     # process prune_factor if supplied
>     my $seg_starts;
>     my $hits_per_seg = 2**31 - 1;
>     if ( defined $self->{prune_factor} and defined $args{num_wanted} ) {
>         my $prune_count = $self->{prune_factor} * $args{num_wanted};
>         if ( $prune_count < $hits_per_seg ) {    # don't exceed I32_MAX
>             $hits_per_seg = $prune_count;
>             $seg_starts   = $reader->get_seg_starts;
>         }
>     }
> What I have not yet written is the index-time mechanism for sorting documents.  
> In Nutch, they use the norms from a known indexed, non-tokenized field ("site").  However,
in Lucene and KS, we can't count on any existing fields.  Document boost isn't stored directly,
either.  The obvious answer is to start storing it, which would suffice for Nutch-like uses.
 However, it may make sense to to avoid coupling document ordering to boost in order to influence
pruning without affecting scores.
> The sort ordering information needs a permanent home in the index, since it will be needed
whenever segment merging occurs.  The fixed-width per-document storage in Lucene's .fdx file
seems like a good place.  If we use one float per document, we can simply put it before or
after the 64-bit file pointer and seek into the file after multiplying the doc num by 12 rather
than 8.  
> During indexing, we'd keep the ordering info in an array; after all documents for a segment
have been added, we create an array of sorted document numbers.  When flushing the postings,
their document numbers get remapped using the sorted array.  Then we rewrite the .fdx file
(and also the .tvx file), moving the file pointers (and ordering info) to remapped locations.
 The fact that the .fdt file is now "out of order" is isn't a problem -- optimizing sequential
access to that file isn't important.
> This issue is closely tied to LUCENE-843, "Improve how IndexWriter uses RAM to buffer
added documents", and LUCENE-847, "Factor merge policy out of IndexWriter".  Michael McCandless,
Steven Parks, Ning Li, anybody else... comments?  Suggestions?
> Marvin Humphrey
> Rectangular Research
> --------------------------------------------------------------------
> void
> Scorer_collect(Scorer *self, HitCollector *hc, u32_t start, u32_t end,
>                u32_t hits_per_seg, VArray *seg_starts)
> {
>     u32_t seg_num          = 0;
>     u32_t doc_num_thresh   = 0;
>     u32_t hits_this_seg    = 0;
>     u32_t hits_thresh      = hits_per_seg;
>     /* get to first doc */
>     if ( !Scorer_Skip_To(self, start) )
>         return;
>     /* execute scoring loop */
>     do {
>         u32_t doc_num = Scorer_Doc(self);
>         Tally *tally;
>         if (hits_this_seg >= hits_thresh || doc_num >= doc_num_thresh) {
>             if (doc_num >= end) {
>                 /* bail out of loop if we've hit the user-spec'd end */
>                 return;
>             }
>             else if (seg_starts == NULL || seg_starts->size == 0) {
>                 /* must supply seg_starts to enable pruning */
>                 hits_thresh    = U32_MAX;
>                 doc_num_thresh = end;
>             }
>             else if (seg_num == seg_starts->size) {
>                 /* we've finished the last segment */
>                 return;
>             }
>             else {
>                 /* get start of upcoming segment */
>                 Int *this_start = (Int*)VA_Fetch(seg_starts, seg_num);
>                 Int *next_start = (Int*)VA_Fetch(seg_starts, seg_num + 1);
>                 u32_t this_seg_start = this_start->value;
>                 seg_num++;
>                 /* skip docs as appropriate if we're pruning */
>                 if (doc_num < this_seg_start) {
>                     if ( Scorer_Skip_To(self, this_seg_start) )
>                         doc_num = Scorer_Doc(self);
>                     else
>                         return;
>                 }
>                 /* set the last doc_num we'll score this upcoming segment */
>                 doc_num_thresh = next_start == NULL
>                     ? end  /* last segment */
>                     : next_start->value;
>             }
>             /* start over counting hits for the new segment */
>             hits_this_seg = 0;
>         }
>         /* this doc is in range, so collect it */
>         tally = Scorer_Tally(self);
>         hc->collect(hc, doc_num, tally->score);
>         hits_this_seg++;
>     } while (Scorer_Next(self));
> }

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message