lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-6349) LocalParams for enabling/disabling individual stats
Date Wed, 25 Feb 2015 23:22:05 GMT

     [ https://issues.apache.org/jira/browse/SOLR-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hoss Man updated SOLR-6349:
---------------------------
    Attachment: make-data-and-queries.pl

SOLR-6349


I did a bit of crude benchmarking this morning with the following two uses cases in mind:
* user currently asks for stats on fields, cares about all 8 of the stats
* user currently asks for stats on fields, only cares about 4of8 of them

the attached script shows my methodology -- it generates a CSV file with 10 million docs +
2 bash files that use curl to hit Solr with 300 *:* query urls using randomly selected stats.field.
 the sequence of stat field requests are identicle between the 2 bash files, but in one URLs
include localparams to only compute min/max/mean/stddev for the field.  

Here's the results...

{noformat}
NOW     BASELINE: 126.008 seconds (ie: all stats ... queries-old.sh)

PATCH  ALL STATS: 133.571 seconds (6% slower ... queries-old.sh)
PATCH FOUR STATS: 130.515 seconds (3% slower ... queries-new.sh)
{noformat}

So not only has asking for all stats on a field gotten slower with this patch, but even asking
for only 4 of the 8 possible numeric stats on a field is still slower then the existing code
when all of them are returned.

A key thing to note here is that this is the total wall clock time from the perspective of
the client, including reading the response from Solr.  Not only are we (in theory) computing
only only 1/2 as much math per request in the "FOUR STATS" situation, the XML response size
of each query is only ~3/4ths the size of the original queryies.  This should mean a lot less
time both in processing the results and in writing/reading the data over the wire ... and
yet instead of seeing some perf improvements, we see performance suffer.

I suspect a key factor here goes back to one of the concerns i mentioned earlier...

{quote}
{code}
if (statsField.calculateStat(X)) { 
  X = calculateX() 
}
{code}
...pattern you mentioned in so much code - that's one of the reasons i abandomed my last patch
(and before i abandoned it, i was focusingon trying to ensure that it was at least always
a comarison with a final boolean in the hops that the JVM could optimize the if away)
{quote}

...the cumulative overhead of those method calls for every possible stat is probably counter
acting any gains made by reducing the stats that are computed.

----

My next step is to focus on fixing the current patch code so the few remaining nocommit assertions
in the test start passing (see earlier comments re "min='false'") -- but once the behavior
is locked down and solid i think we really need to re-assess and re-factor the code to see
some perf gains before there's any point in moving towards adding this feature.

(NOTE: if anyone spots any flaws in my little mini-benchmark, please speak up -- i would be
very happy to be wrong)



> LocalParams for enabling/disabling individual stats
> ---------------------------------------------------
>
>                 Key: SOLR-6349
>                 URL: https://issues.apache.org/jira/browse/SOLR-6349
>             Project: Solr
>          Issue Type: Sub-task
>            Reporter: Hoss Man
>         Attachments: SOLR-6349-tflobbe.patch, SOLR-6349-tflobbe.patch, SOLR-6349-tflobbe.patch,
SOLR-6349-xu.patch, SOLR-6349-xu.patch, SOLR-6349-xu.patch, SOLR-6349-xu.patch, SOLR-6349.patch,
SOLR-6349.patch, SOLR-6349.patch, SOLR-6349.patch, SOLR-6349___bad_idea_broken.patch, make-data-and-queries.pl
>
>
> Stats component currently computes all stats (except for one) every time because they
are relatively cheap, and in some cases dependent on eachother for distrib computation --
but if we start layering stats on other things it becomes unnecessarily expensive to compute
all the stats when they just want the "sum" (and it will definitely become excessively verbose
in the responses).  
> The plan here is to use local params to make this configurable.  All of the existing
stat options could be modeled as a simple boolean param, but future params (like percentiles)
might take in a more complex param value...
> Example:
> {noformat}
> stats.field={!min=true max=true percentiles='99,99.999'}price
> stats.field={!mean=true}weight
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message