lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Bernstein (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-10096) Parallel SQL not returning the correct results for count distinct aggregation
Date Fri, 03 Feb 2017 21:00:53 GMT

    [ https://issues.apache.org/jira/browse/SOLR-10096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852119#comment-15852119
] 

Joel Bernstein edited comment on SOLR-10096 at 2/3/17 9:00 PM:
---------------------------------------------------------------

About the performance issue you were seeing. The SQL query is likely in map_reduce mode which
is the default for the SQL interface. If you switch to aggregationMode=facet you'll likely
see comparable performance to the expression you reference.

You can improve your map_reduce throughput by adding workers so it will parallelize the distinct
operation.


was (Author: joel.bernstein):
About the performance issue you were seeing. The SQL query is likely in map_reduce mode which
is the default or the SQL interface. If you switch to aggregationMode=facet you'll likely
see comparable performance to expression you reference.

You can improve your map_reduce throughput by adding workers so it parallelize the select
distinct.

> Parallel SQL not returning the correct results for count distinct aggregation
> -----------------------------------------------------------------------------
>
>                 Key: SOLR-10096
>                 URL: https://issues.apache.org/jira/browse/SOLR-10096
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Parallel SQL
>    Affects Versions: 6.3
>            Reporter: Timothy Potter
>
> While trying to compare the results of this perf exp. blogged here:
> https://www.periscopedata.com/blog/use-subqueries-to-count-distinct-50x-faster.html
> I tried this with 6.3 and it only returns the dashboard_id's:
> {code}
> select dashboard_id, count(distinct user_id) as ct from time_on_site_logs group by dashboard_id
> {code}
> Results: (where is ct?)
> {code}
> {"result-set":{"docs":[
> {"dashboard_id":0},
> {"dashboard_id":2},
> {"dashboard_id":5},
> {"dashboard_id":6},
> {"dashboard_id":8},
> {"dashboard_id":10},
> {"dashboard_id":12},
> {"dashboard_id":13},
> {"dashboard_id":14},
> {"dashboard_id":15},
> ...
> {code}
> So I dropped the alias for {{count(distinct user_id)}} and got the wrong results (i.e.
it's not applying the distinct for user_id's):
> {code}
> {"result-set":{"docs":[
> {"count(*)":8288,"dashboard_id":0},
> {"count(*)":7392,"dashboard_id":2},
> {"count(*)":23800,"dashboard_id":5},
> {"count(*)":25032,"dashboard_id":6},
> {"count(*)":8960,"dashboard_id":8},
> {"count(*)":7840,"dashboard_id":10},
> {"count(*)":17192,"dashboard_id":12},
> {code}
> So I'm guessing this isn't a supported syntax yet?
> Also, probably a different issue, but I then tried this query:
> {code}
> select distinct dashboard_id, user_id from time_on_site_logs
> {code}
> and it took ~70 secs (m3.xlarge), which is very surprising because this streaming expression
(which basically does the initial query above):
> {code}
> select(rollup(facet(time_on_site_logs, q="*:*", buckets="dashboard_id,user_id", bucketSizeLimit=2000,
bucketSorts="dashboard_id asc",count(*)), over="dashboard_id", count(*)), dashboard_id,count(*)
as unique_users)
> {code}
> Returns in ~1.5 secs ... that's a huge difference in performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message