lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Bernstein (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-7082) Streaming Aggregation for SolrCloud
Date Fri, 06 Feb 2015 15:22:34 GMT

     [ https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joel Bernstein updated SOLR-7082:
---------------------------------
    Description: 
This issue provides a general purpose streaming aggregation framework for SolrCloud. An overview
of how it works can be found at this link:

http://heliosearch.org/streaming-aggregation-for-solrcloud/

This functionality allows SolrCloud users to perform operations that we're typically done
using map/reduce or a parallel computing platform.

Here is brief explanation of how the framework works:

There is a new Solrj io package:

*org.apache.solr.client.solrj.io*

Key classes:

*Tuple* Abstracts a document in a search result as a Map of key/value pairs.
*TupleStream* is the base class for all of the streams. Abstracts search results as a stream
of Tuples.
*SolrStream* connects to a single Solr instance. You call the read() method to iterate over
the Tuples.
*CloudSolrStream* connects to a SolrCloud collection and merges the results based on the sort
param. The merge takes place in CloudSolrStream itself.
*Decorator Streams* wrap other streams to gather *Metrics* on streams and *transform* the
streams. Some examples are the MetricStream, RollupStream, GroupByStream, UniqueStream, MergeJoinStream,
HashJoinStream, MergeStream, FilterStream.

*Going Parallel and "Worker Collections"*

The io package also contains the *ParallelStream*, which wraps a TupleStream and sends it
to N worker nodes. The workers are chosen from a SolrCloud collection. These "Worker Collection"
don't have to hold any data, they can just be used to execute TupleStreams.

*The StreamHandler*

The Worker nodes have a new RequestHandler called the *StreamHandler*. The ParallelStream
serializes a TupleStream, before it is opened, and sends it the StreamHandler on the Worker
Nodes.

The StreamHandler on each Worker node deserializes the TupleStream, opens the stream, iterates
the tuples and streams them back to the ParallelStream. The ParallelStream performs the final
merge of Metrics and can be wrapped by other Streams to handled the final merged TupleStream.

*Soritng and Partitioning search results (Shuffling)*

Each Worker node is shuffled 1/N of the document results. There is a "partitionKeys" parameter
that can be included with each TupleStream to ensure that Tuples with the same partitionKeys
are shuffled to the same Worker. The actual partitioning is done with a filter query using
the HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in the filter
cache which provides extremely high performance hash partitioning. 

Many of the stream transformations rely on the sort order of the TupleStreams. The search
results can be sorted by specific keys. The "/export" handler can be used to sort entire result
sets efficiently.










  was:
This issue provides a general purpose streaming aggregation framework for SolrCloud. An overview
of how it works can be found at this link:

http://heliosearch.org/streaming-aggregation-for-solrcloud/

This functionality allows SolrCloud users to perform operations that we're typically done
using map/reduce or a parallel computing platform.

Here is brief explanation of how the framework works:

There is a new Solrj io package:

*org.apache.solr.client.solrj.io*

Key classes:

*Tuple* Abstracts a document in a search result as a Map of key/value pairs.
*TupleStream* is the base class for all of the streams. Abstracts search results as a stream
of Tuples.
*SolrStream* connects to a single Solr instance. You call the read() method to iterate over
the Tuples.
*CloudSolrStream* connects to a SolrCloud collection and merges the results based on the sort
param. The merge takes place in CloudSolrStream itself.
*Decorator Streams* wrap other streams to gather *Metrics* on streams and *transform* the
streams. Some examples are the *MetricStream, RollupStream, GroupByStream, UniqueStream, MergeJoinStream,
HashJoinStream, MergeStream, FilterStream*.

*Going Parallel and "Worker Collections"*

The io package also contains the *ParallelStream*, which wraps a TupleStream and sends it
to N worker nodes. The workers are chosen from a SolrCloud collection. These "Worker Collection"
don't have to hold any data, they can just be used to execute TupleStreams.

*The StreamHandler*

The Worker nodes have a new RequestHandler called the *StreamHandler*. The ParallelStream
serializes a TupleStream, before it is opened, and sends it the StreamHandler on the Worker
Nodes.

The StreamHandler on each Worker node deserializes the TupleStream, opens the stream, iterates
the tuples and streams them back to the ParallelStream. The ParallelStream performs the final
merge of Metrics and can be wrapped by other Streams to handled the final merged TupleStream.

*Soritng and Partitioning search results (Shuffling)*

Each Worker node is shuffled 1/N of the document results. There is a "partitionKeys" parameter
that can be included with each TupleStream to ensure that Tuples with the same partitionKeys
are shuffled to the same Worker. The actual partitioning is done with a filter query using
the HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in the filter
cache which provides extremely high performance hash partitioning. 

Many of the stream transformations rely on the sort order of the TupleStreams. The search
results can be sorted by specific keys. The "/export" handler can be used to sort entire result
sets efficiently.











> Streaming Aggregation for SolrCloud
> -----------------------------------
>
>                 Key: SOLR-7082
>                 URL: https://issues.apache.org/jira/browse/SOLR-7082
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>            Reporter: Joel Bernstein
>             Fix For: Trunk
>
>         Attachments: SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for SolrCloud.
An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're typically
done using map/reduce or a parallel computing platform.
> Here is brief explanation of how the framework works:
> There is a new Solrj io package:
> *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple* Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream* is the base class for all of the streams. Abstracts search results as a
stream of Tuples.
> *SolrStream* connects to a single Solr instance. You call the read() method to iterate
over the Tuples.
> *CloudSolrStream* connects to a SolrCloud collection and merges the results based on
the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams* wrap other streams to gather *Metrics* on streams and *transform*
the streams. Some examples are the MetricStream, RollupStream, GroupByStream, UniqueStream,
MergeJoinStream, HashJoinStream, MergeStream, FilterStream.
> *Going Parallel and "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream and sends
it to N worker nodes. The workers are chosen from a SolrCloud collection. These "Worker Collection"
don't have to hold any data, they can just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The ParallelStream
serializes a TupleStream, before it is opened, and sends it the StreamHandler on the Worker
Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the stream,
iterates the tuples and streams them back to the ParallelStream. The ParallelStream performs
the final merge of Metrics and can be wrapped by other Streams to handled the final merged
TupleStream.
> *Soritng and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a "partitionKeys"
parameter that can be included with each TupleStream to ensure that Tuples with the same partitionKeys
are shuffled to the same Worker. The actual partitioning is done with a filter query using
the HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in the filter
cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams. The search
results can be sorted by specific keys. The "/export" handler can be used to sort entire result
sets efficiently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message