lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Bernstein (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-8962) Add sort Streaming Expression
Date Thu, 28 Apr 2016 01:40:13 GMT

    [ https://issues.apache.org/jira/browse/SOLR-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261369#comment-15261369
] 

Joel Bernstein commented on SOLR-8962:
--------------------------------------

I've been thinking a little bit about this ticket. One of this nice things it provides is
a the ability to re-sort a set following a join. So we could innerJoin->sort-rollup, which
is a key use case. We can also innerJoin->sort->innerJoin which is also a key use case.

I did a quick test to see how many random strings could be sorted per-second. I used the Random
class to pick random longs and turned the longs into Strings for the test set. 

I was seeing sort times of 1 second for 1.5 million random strings, using Collections.sort().


So with 50 workers that translates to roughly 75 million records per second. 

With fork/join merge sort we should be able to scale nearly linearly until we hit the number
of processors on the server. This is because of the tight memory locality of sorting, which
won't saturate the memory bus. So with 8 threads we can expect to sort close to 12 million
records per second on each worker. Now we're talking some big numbers. With 50 workers we'd
be sorting 600,000,000 records per-second. 

What's nice about the fork/join is it gives us two levels of parallelism. We get the first
level a of parallelism by having multiple workers and then we get the second level by threading.
I see some very fast operations following joins in the future.



> Add sort Streaming Expression
> -----------------------------
>
>                 Key: SOLR-8962
>                 URL: https://issues.apache.org/jira/browse/SOLR-8962
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Joel Bernstein
>            Assignee: Dennis Gove
>            Priority: Critical
>             Fix For: master, 6.1
>
>         Attachments: SOLR-8962.patch, SOLR-8962.patch
>
>
> The sort Streaming Expression does an in memory sort of the Tuples returned by it's underlying
stream. This is intended to be used for sorting sets gathered during local graph traversals.
This will make it easy to gather sets during a traversal and use all of the sort based set
operations (merge, innerJoin, outerJoin, reduce, complement, intersect). 
> This will be particularly useful with the gatherNodes expression (SOLR-8925). Sample
syntax:
> {code}
> intersect(
>        sort(gatherNodes(...), "fieldA asc"),
>        sort(gatherNodes(...), "fieldA asc"),
>        on)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message