lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan McKinley <ryan...@gmail.com>
Subject Re: help refactoring from 3.x to 4.x
Date Mon, 23 Aug 2010 14:51:04 GMT
On Mon, Aug 23, 2010 at 7:00 AM, Michael McCandless
<lucene@mikemccandless.com> wrote:
> Spooky that you see incorrect results!  The code looks correct.  What
> are the specifics on when it produces an invalid result?

Figured this out -- the above code is not invalid, however i tried
versions that movedthe utf8ToString() the end -- however the BytesRef
reuse made this not accurate.

no need to get spooked here -- user error.

>
> Also spooky that you see it running slower -- how much slower?

much slower -- this component took ~30-100ms in 3.x and 30-45 sec in 4.x

> Did you rebuild the index in 4.x (if not, you are using the preflex
> codec)?  And is the index otherwise identical?

I have tried both:
3.x index loaded into 4.x then run optimize
rebuild 3.x index in 4.x

these have the same performance. (bad)

re preflex codec?  How could I tell?  Do I need to do anything to explicit?



>
> You could improve perf by not using SolrIndexSearcher.numDocs?  Ie you
> don't need the count; you just need to know if it's > 0.  So you could
> make your own loop that breaks out on the first docID in common.  You
> could also stick w/ BytesRef the whole time (only do .utf8ToString()
> in the end on the first/last), though this is presumably a net/nets
> tiny cost.
>

Ah yes -- this helps a lot!

The following code gets similar performance to the 3.x version.  I
kept the 'utf8ToString' in the loop since the alternative was to copy
it out anyway to avoid reuse.

  public static FirstLastMatchingTerm read(final SolrIndexSearcher
searcher, final String field, final DocSet docs) throws IOException
  {
    FirstLastMatchingTerm firstLast = new FirstLastMatchingTerm();
    if( docs.size() > 0 ) {
      IndexReader reader = searcher.getReader();

      DocsEnum denum = null;
      Terms terms = MultiFields.getTerms(reader, field);
      TermsEnum te = terms.iterator();
      BytesRef bytes = te.next();
      while( bytes != null ) {
        denum = terms.docs(null, bytes, denum);
        if( denum != null ) {
          // find if any doc matches our result set
          while( denum.nextDoc() != DocsEnum.NO_MORE_DOCS ) {
            if( docs.exists( denum.docID() ) ) {
              String v = bytes.utf8ToString();
              if( v.length() > 0 ) {
                firstLast.last = v;
                if( firstLast.first == null ) {
                  firstLast.first = v;
                }
                break;
              }
            }
          }
        }
        bytes = te.next();
      }
    }
    return firstLast;
  }




> But, we should still dig down on why numDocs is slower in 4.x; that's
> unexpected; Yonik any ideas?  I'm not familiar with this part of
> Solr...
>
> Mike
>
> On Mon, Aug 23, 2010 at 2:38 AM, Ryan McKinley <ryantxu@gmail.com> wrote:
>> I have a function that works well in 3.x, but when I tried to
>> re-implement in 4.x it runs very very slow (~20ms vs 45s on an index w
>> ~100K items).
>>
>> Big picture, I am trying to calculate a bounding box for items that
>> match the query.  To calculate this, I have two fields bboxNS, and
>> bboxEW that get filled with the min and max values for that doc.  To
>> get the bounding box, I just need the first matching term in the index
>> and the last matching term.
>>
>> In 3.x the code looked like this:
>>
>> public class FirstLastMatchingTerm
>> {
>>  String first = null;
>>  String last = null;
>>
>>  public static FirstLastMatchingTerm read(SolrIndexSearcher searcher,
>> String field, DocSet docs) throws IOException
>>  {
>>    FirstLastMatchingTerm firstLast = new FirstLastMatchingTerm();
>>    if( docs.size() > 0 ) {
>>      IndexReader reader = searcher.getReader();
>>      TermEnum te = reader.terms(new Term(field,""));
>>      do {
>>        Term t = te.term();
>>        if( null == t || !t.field().equals(field) ) {
>>          break;
>>        }
>>
>>        if( searcher.numDocs(new TermQuery(t), docs) > 0 ) {
>>          firstLast.last = t.text();
>>          if( firstLast.first == null ) {
>>            firstLast.first = firstLast.last;
>>          }
>>        }
>>      }
>>      while( te.next() );
>>    }
>>    return firstLast;
>>  }
>> }
>>
>>
>> In 4.x, I tried:
>>
>> public class FirstLastMatchingTerm
>> {
>>  String first = null;
>>  String last = null;
>>
>>  public static FirstLastMatchingTerm read(SolrIndexSearcher searcher,
>> String field, DocSet docs) throws IOException
>>  {
>>    FirstLastMatchingTerm firstLast = new FirstLastMatchingTerm();
>>    if( docs.size() > 0 ) {
>>      IndexReader reader = searcher.getReader();
>>
>>      Terms terms = MultiFields.getTerms(reader, field);
>>      TermsEnum te = terms.iterator();
>>      BytesRef term = te.next();
>>      while( term != null ) {
>>        if( searcher.numDocs(new TermQuery(new Term(field,term)), docs) > 0
) {
>>          firstLast.last = term.utf8ToString();
>>          if( firstLast.first == null ) {
>>            firstLast.first = firstLast.last;
>>          }
>>        }
>>        term = te.next();
>>      }
>>    }
>>    return firstLast;
>>  }
>> }
>>
>> but the results are slow (and incorrect).  I tried some variations of
>> using ReaderUtil.Gather(), but the real hit seems to come from
>>  if( searcher.numDocs(new TermQuery(new Term(field,term)), docs) > 0 )
>>
>> Any ideas?  I'm not tied to the approach or indexing strategy, so if
>> anyone has other suggestions that would be great.  Looking at it
>> again, it seems crazy that you have to run a query for each term, but
>> in 3.x
>>
>> thanks
>> ryan
>>
>

Mime
View raw message