lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Russell (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-5637) Scaling scale function
Date Thu, 01 May 2014 20:28:14 GMT
Chris Russell created LUCENE-5637:
-------------------------------------

             Summary: Scaling scale function
                 Key: LUCENE-5637
                 URL: https://issues.apache.org/jira/browse/LUCENE-5637
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Chris Russell
            Priority: Minor
             Fix For: 4.8


The existing scale() function examines the scores of all documents in the index in order to
calculate its scale constant.  This does not perform well in solr on very large indexes or
with costly scoring mechanisms such as geo distance.

I have developed a patch that allows the scale function to only score documents that match
the given filters, thus improving performance of the scale function.  

For test queries involving two scale operations where one was scaling the result of keyword
scoring and the other was scaling the result of geo distance scoring on an index with ~2 million
documents, query time was improved from ~400 ms with vanilla scale to ~190 ms with new scale.
 A similar query using no scaling ran in ~90 ms.  (Each enhanced scale function added to the
query appeared to add about 50 ms of processing)
e.g. scaled query - q = scale(keywords, 0, 90) and scale(geo, 0, 10)
e.g. unscaled query - q = keywords and geo
In both cases fq includes keywords and geo.

In order to accomplish this goal I had to introduce a couple of changes:
1) In the indexsearcher.search method where scorers are created and then used to score on
a per-atomicreadercontext basis I had to make it so that all scorers would be created before
any scoring was done.  This was so that the scale function would have an opportunity to observe
the entire index before being asked to score something.
2) Introduced a new property to the Bits interface that indicates whether or not the bits
provide constant-time access.  Why? Read on.
3) FilterSet used to return Null when asked for its bits because it did not have any, it had
an iterator.  This was an issue when trying to make it so that scale would only score documents
matching the filter.  Thus a new bits implementation was added (LazyIteratorBackedBits) that
could expose an iterator as a Bits implementation.  It advances the iterator on-demand when
asked about a document and uses an OpenBitSet to keep track of what it has advanced beyond.
 Thus once the iterator is exhausted it provides constant-time answers like any other Bits.
4) Introduced a function on the ValueSource interface to allow a Bits to be passed in for
filtering purposes.

This was originally developed against Solr 4.2 but I have ported it to Solr 4.8.  There is
one failing unit test related to code that has been added in the interim, AnalyzingInfixSuggesterTest.testRandomNRT.
 I have not been able to figure out why this test fails.  All other tests pass.

In relation to implementation detail 1) above, the introduction of LeafCollectors in trunk
has caused somewhat of an issue.  It seems to no longer be possible to create multiple scorers
without immediately scoring on that LeafCollector.  This may be related to the encapsulation
of the Collector.setNextReader() method which was very useful for this purpose.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message