lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomás Fernández Löbbe (JIRA) <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-6216) Better faceting for multiple intervals on DV fields
Date Wed, 02 Jul 2014 22:35:24 GMT

    [ https://issues.apache.org/jira/browse/SOLR-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048153#comment-14048153
] 

Tomás Fernández Löbbe edited comment on SOLR-6216 at 7/2/14 10:35 PM:
----------------------------------------------------------------------

I did some very basic performance testing to compare interval faceting vs facet queries: 
Dataset: Geonames.org dataset (added 4 times to make it a 33M docs)
Query Set: 4960 boolean queries using terms from the dataset
1 document updated every second
autoSoftCommit every second. 
HW: MacBook Pro Core i7, 2.7 GHz with 8 GB of RAM with spinning disk (5400 RPM)
All times are in milliseconds
Repeated the test with different number of intervals (on the “population” field of the
geonames dataset)

|| || 	Num Intervals ||	1 ||	2 ||	3 ||	4 ||	5 ||	10 ||
| Min	|Intervals |	25 |	23 |	26 |	23 |	24 |	26 |
| | Facet Query |	2 |	2 |	3 |	4 |	4 |	6 |
|Max |	Intervals |	1885 |	2254 |	2508 |	2800 |	2749 |	3031 |
| | Facet Query |	2199 |	2414 |	3957 |	2766 |	1869 |	5975 |
| Average	| Intervals |	181 |	177 |	191 |	183 |	148 |	174 |
| |Facet Query|	156|	277|	359|	299|	216|	408|
|P10	|Intervals	|53	|54	|54	|54	|54	|56|
| |Facet Query	|26	|30	|33	|31	|29	|35|
|P50	|Intervals	|96	|95	|98	|97	|88	|96|
| |Facet Query	|54	|211	|293	|188	|58	|74|
|P90	|Intervals	|453	|940	|467	|458	|350	|438|
| |Facet Query	|432	|656	|794	|749	|660	|1066|
|P99	|Intervals	|809	|884	|968	|877	|857	|897|
| |Facet Query	|867	|1041	|1354	|1219	|1116	|1784|

There is some variation between the tests with different number of intervals (with the same
method) that I don’t understand very well. For each test, I’d restart the jetty (index
files are probably cached between tests though).

In general what I see is that the average is similar or lower than facet query, the p10 and
p50 similar or higher than facet query (these are probably the cases where the facet queries
hit cache), and lower p90 p99 for the Intervals impl. This probably because of facet query
missing cache. 

“Max” variates a lot, I don’t think it’s a very representative number, I just left
it for completeness. Min is very similar for all cases, it’s obvious that in the best case
(all cache hits), facet query is much faster than intervals. 

I also did a quick test on an internal collection with around 100M docs in a single shard,
ran around 6000 queries with around 40 intervals each, for this test I got: 

|Min	|Intervals	|122|
| |Facet Query	|124|
|Max	 |Intervals	|6626|
| |Facet Query	|61009|
|Average	|Intervals	|238|
| |Facet Query	|620|
|P10	|Intervals	|155|
| |Facet Query	|151|
|P50	|Intervals	|201|
| |Facet Query	|202|
|P90	|Intervals	|324|
| |Facet Query	|461|
|P99	|Intervals	|836|
| |Facet Query	|23662|
 
This collection has updates and soft commits. 
I don’t have numbers for distributed tests, but from what I could see, the result was even
better on wide collections, because of the lower p90/p99 I assume. 


was (Author: tomasflobbe):
I did some very basic performance testing to compare interval faceting vs facet queries: 
Dataset: Geonames.org dataset (added 4 times to make it a 33M docs)
Query Set: 4960 boolean queries using terms from the dataset
1 document updated every second
autoSoftCommit every second. 
HW: MacBook Pro Core i7, 2.7 GHz with 8 GB of RAM with spinning disk (5400 RPM)
All times are in milliseconds
Repeated the test with different number of intervals (on the “population” field of the
geonames dataset)

|| || 	Num Intervals ||	1 ||	2 ||	3 ||	4 ||	5 ||	10 ||
| Min	|Intervals |	25 |	23 |	26 |	23 |	24 |	26 |
| | Facet Query |	2 |	2 |	3 |	4 |	4 |	6 |
|Max |	Intervals |	1885 |	2254 |	2508 |	2800 |	2749 |	3031 |
| | Facet Query |	2199 |	2414 |	3957 |	2766 |	1869 |	5975 |
| Average	| Intervals |	181 |	177 |	191 |	183 |	148 |	174 |
| |Facet Query|	156|	277|	359|	299|	216|	408|
|P10	|Intervals	|53	|54	|54	|54	|54	|56|
| |Facet Query	|26	|30	|33	|31	|29	|35|
|P50	|Intervals	|96	|95	|98	|97	|88	|96|
| |Facet Query	|54	|211	|293	|188	|58	|74|
|P90	|Intervals	|453	|940	|467	|458	|350	|438|
| |Facet Query	|432	|656	|794	|749	|660	|1066|
|P99	|Intervals	|809	|884	|968	|877	|857	|897|
| |Facet Query	|867	|1041	|1354	|1219	|1116	|1784|

There is some variation between the tests with different number of intervals (with the same
method) that I don’t understand very well. For each test, I’d restart the jetty (index
files are probably cached between tests though).

In general what I see is that the average is similar or lower than facet query, the p10 and
p50 similar or higher than facet query (these are probably the cases where the facet queries
hit cache), and lower p90 p99 for the Intervals impl. This probably because of facet query
missing cache. 

“Max” variates a lot, I don’t think it’s a very representative number, I just left
it for completeness. Min is very similar for all cases, it’s obvious that in the best case
(all cache hits), facet query is much faster than intervals. 

I also did a quick test on an internal collection with around 100M docs in a single shard,
ran around 6000 queries with around 40 intervals each, for this test I got: 

|Min	|Intervals	|122|
| |Facet Query	|124|
|Max	 |Intervals	|6626|
| |Facet Query	|61009|
|Average	|Intervals	|238|
| |Facet Query	|620|
|P10	|Intervals	|155|
| |Facet Query	|151|
|P50	|Intervals	|201|
| |Facet Query	|202|
|P90	|Intervals	|324|
| |Facet Query	|461|
|P99	|Intervals	|836|
| |Facet Query	|23662|
 
This domain has updates and soft commits. 
I don’t have numbers for distributed tests, but from what I could see, the result was even
better on wide domains, because of the lower p90/p99 I assume. 

> Better faceting for multiple intervals on DV fields
> ---------------------------------------------------
>
>                 Key: SOLR-6216
>                 URL: https://issues.apache.org/jira/browse/SOLR-6216
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Tomás Fernández Löbbe
>         Attachments: SOLR-6216.patch, SOLR-6216.patch, SOLR-6216.patch, SOLR-6216.patch,
SOLR-6216.patch
>
>
> There are two ways to have faceting on values ranges in Solr right now: “Range Faceting”
and “Query Faceting” (doing range queries). They both end up doing something similar:
> {code:java}
> searcher.numDocs(rangeQ , docs)
> {code}
> The good thing about this implementation is that it can benefit from caching. The bad
thing is that it may be slow with cold caches, and that there will be a query for each of
the ranges.
> A different implementation would be one that works similar to regular field faceting,
using doc values and validating ranges for each value of the matching documents. This implementation
would sometimes be faster than Range Faceting / Query Faceting, specially on cases where caches
are not very effective, like on a high update rate, or where ranges change frequently.
> Functionally, the result should be exactly the same as the one obtained by doing a facet
query for every interval



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message