lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Russell (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-5637) Scaling scale function
Date Thu, 01 May 2014 20:58:18 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-5637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris Russell updated LUCENE-5637:
----------------------------------

    Description: 
The existing scale() function examines the scores of all documents in the index in order to
calculate its scale constant.  This does not perform well in solr on very large indexes or
with costly scoring mechanisms such as geo distance.

I have developed a patch that allows the scale function to only score documents that match
the given filters, thus improving performance of the scale function.  

For test queries involving two scale operations where one was scaling the result of keyword
scoring and the other was scaling the result of geo distance scoring on an index with ~2 million
documents, query time was improved from ~400 ms with vanilla scale to ~190 ms with new scale.
 A similar query using no scaling ran in ~90 ms.  (Each enhanced scale function added to the
query appeared to add about 50 ms of processing)
e.g. scaled query - q = scale(keywords, 0, 90) and scale(geo, 0, 10)
e.g. unscaled query - q = keywords and geo
In both cases fq includes keywords and geo.

In order to accomplish this goal I had to introduce a couple of changes:
1) In the indexsearcher.search method where scorers are created and then used to score on
a per-atomicreadercontext basis I had to make it so that all scorers would be created before
any scoring was done.  This was so that the scale function would have an opportunity to observe
the entire index before being asked to score something.
2) Introduced a new property to the Bits interface that indicates whether or not the bits
provide constant-time access.  Why? Read on.
3) FilterSet used to return Null when asked for its bits because it did not have any, it had
an iterator.  This was an issue when trying to make it so that scale would only score documents
matching the filter.  Thus a new bits implementation was added (LazyIteratorBackedBits) that
could expose an iterator as a Bits implementation.  It advances the iterator on-demand when
asked about a document and uses an OpenBitSet to keep track of what it has advanced beyond.
 Thus once the iterator is exhausted it provides constant-time answers like any other Bits.
4) Introduced a function on the ValueSource interface to allow a Bits to be passed in for
filtering purposes.

This was originally developed against Solr 4.2 but I have ported it to Solr 4.8.  There is
one failing unit test related to code that has been added in the interim, AnalyzingInfixSuggesterTest.testRandomNRT.
 I have not been able to figure out why this test fails.  All other tests pass.

In relation to implementation detail 1) above, the introduction of LeafCollectors in trunk
has caused somewhat of an issue. ( LUCENE-5527 ) It seems to no longer be possible to create
multiple scorers without immediately scoring on that LeafCollector.  This may be related to
the encapsulation of the Collector.setNextReader() method which was very useful for this purpose.

  was:
The existing scale() function examines the scores of all documents in the index in order to
calculate its scale constant.  This does not perform well in solr on very large indexes or
with costly scoring mechanisms such as geo distance.

I have developed a patch that allows the scale function to only score documents that match
the given filters, thus improving performance of the scale function.  

For test queries involving two scale operations where one was scaling the result of keyword
scoring and the other was scaling the result of geo distance scoring on an index with ~2 million
documents, query time was improved from ~400 ms with vanilla scale to ~190 ms with new scale.
 A similar query using no scaling ran in ~90 ms.  (Each enhanced scale function added to the
query appeared to add about 50 ms of processing)
e.g. scaled query - q = scale(keywords, 0, 90) and scale(geo, 0, 10)
e.g. unscaled query - q = keywords and geo
In both cases fq includes keywords and geo.

In order to accomplish this goal I had to introduce a couple of changes:
1) In the indexsearcher.search method where scorers are created and then used to score on
a per-atomicreadercontext basis I had to make it so that all scorers would be created before
any scoring was done.  This was so that the scale function would have an opportunity to observe
the entire index before being asked to score something.
2) Introduced a new property to the Bits interface that indicates whether or not the bits
provide constant-time access.  Why? Read on.
3) FilterSet used to return Null when asked for its bits because it did not have any, it had
an iterator.  This was an issue when trying to make it so that scale would only score documents
matching the filter.  Thus a new bits implementation was added (LazyIteratorBackedBits) that
could expose an iterator as a Bits implementation.  It advances the iterator on-demand when
asked about a document and uses an OpenBitSet to keep track of what it has advanced beyond.
 Thus once the iterator is exhausted it provides constant-time answers like any other Bits.
4) Introduced a function on the ValueSource interface to allow a Bits to be passed in for
filtering purposes.

This was originally developed against Solr 4.2 but I have ported it to Solr 4.8.  There is
one failing unit test related to code that has been added in the interim, AnalyzingInfixSuggesterTest.testRandomNRT.
 I have not been able to figure out why this test fails.  All other tests pass.

In relation to implementation detail 1) above, the introduction of LeafCollectors in trunk
has caused somewhat of an issue.  It seems to no longer be possible to create multiple scorers
without immediately scoring on that LeafCollector.  This may be related to the encapsulation
of the Collector.setNextReader() method which was very useful for this purpose.


> Scaling scale function
> ----------------------
>
>                 Key: LUCENE-5637
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5637
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Chris Russell
>            Priority: Minor
>              Labels: patch, performance
>             Fix For: 4.8
>
>         Attachments: Lucene-5637.patch
>
>
> The existing scale() function examines the scores of all documents in the index in order
to calculate its scale constant.  This does not perform well in solr on very large indexes
or with costly scoring mechanisms such as geo distance.
> I have developed a patch that allows the scale function to only score documents that
match the given filters, thus improving performance of the scale function.  
> For test queries involving two scale operations where one was scaling the result of keyword
scoring and the other was scaling the result of geo distance scoring on an index with ~2 million
documents, query time was improved from ~400 ms with vanilla scale to ~190 ms with new scale.
 A similar query using no scaling ran in ~90 ms.  (Each enhanced scale function added to the
query appeared to add about 50 ms of processing)
> e.g. scaled query - q = scale(keywords, 0, 90) and scale(geo, 0, 10)
> e.g. unscaled query - q = keywords and geo
> In both cases fq includes keywords and geo.
> In order to accomplish this goal I had to introduce a couple of changes:
> 1) In the indexsearcher.search method where scorers are created and then used to score
on a per-atomicreadercontext basis I had to make it so that all scorers would be created before
any scoring was done.  This was so that the scale function would have an opportunity to observe
the entire index before being asked to score something.
> 2) Introduced a new property to the Bits interface that indicates whether or not the
bits provide constant-time access.  Why? Read on.
> 3) FilterSet used to return Null when asked for its bits because it did not have any,
it had an iterator.  This was an issue when trying to make it so that scale would only score
documents matching the filter.  Thus a new bits implementation was added (LazyIteratorBackedBits)
that could expose an iterator as a Bits implementation.  It advances the iterator on-demand
when asked about a document and uses an OpenBitSet to keep track of what it has advanced beyond.
 Thus once the iterator is exhausted it provides constant-time answers like any other Bits.
> 4) Introduced a function on the ValueSource interface to allow a Bits to be passed in
for filtering purposes.
> This was originally developed against Solr 4.2 but I have ported it to Solr 4.8.  There
is one failing unit test related to code that has been added in the interim, AnalyzingInfixSuggesterTest.testRandomNRT.
 I have not been able to figure out why this test fails.  All other tests pass.
> In relation to implementation detail 1) above, the introduction of LeafCollectors in
trunk has caused somewhat of an issue. ( LUCENE-5527 ) It seems to no longer be possible to
create multiple scorers without immediately scoring on that LeafCollector.  This may be related
to the encapsulation of the Collector.setNextReader() method which was very useful for this
purpose.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message