lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-13399) compositeId support for shard splitting
Date Wed, 10 Jul 2019 18:44:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882321#comment-16882321
] 

Yonik Seeley edited comment on SOLR-13399 at 7/10/19 6:43 PM:
--------------------------------------------------------------

Here's a draft patch (no tests yet) for feedback.
This adds a parameter "splitByPrefix" to SPLITSHARD.  When the overseer sees this parameter,
it sends an additional SPLIT request with the "getRanges" parameter set.  This causes SPLIT
(SplitOp.java) to calculate the ranges based on the prefix field "id_prefix" and return the
recommended split string in the response in the "ranges" parameter.  SPLITSHARD in the overseer
then proceeds as if that ranges string had been passed in by the user.

"id_prefix" is currently populated via a copyField in the schema:
{code}
    <!-- needed for splitByPrefix -->
  <field name="id_prefix" type="composite_id_prefix" indexed="true" stored="false"/>
  <copyField source="id" dest="id_prefix"/>
  <fieldtype name="composite_id_prefix" class="solr.TextField">
    <analyzer>
      <tokenizer class="solr.PatternTokenizerFactory" pattern=".*!" group="0"/>
    </analyzer>
  </fieldtype>
{code}

The prefix field is currently always "id_prefix" (convention / implicit).  Not sure if it
adds value to make it configurable via a "field" parameter on the SPLITSHARD command.



was (Author: yseeley@gmail.com):
Here's a draft patch (no tests yet) for feedback.
This adds a parameter "splitByPrefix" to SPLITSHARD.  When the overseer sees this parameter,
it sends an additional SPLIT request with the "getRanges" parameter set.  This causes SPLIT
(SplitOp.java) to calculate the ranges based on the prefix field "id_prefix" and return the
recommended split string in the response in the "ranges" parameter.  SPLITSHARD in the overseer
then proceeds as if that ranges string had been passed in by the user.

"id_prefix" is currently populated via a copyField in the schema:
{code}
    <!-- needed for splitByPrefix -->
  <field name="id_prefix" type="composite_id_prefix" indexed="true" stored="false"/>
  <copyField source="id" dest="id_prefix"/>
  <fieldtype name="composite_id_prefix" class="solr.TextField">
    <analyzer>
      <tokenizer class="solr.PatternTokenizerFactory" pattern=".*!" group="0"/>
    </analyzer>
  </fieldtype>
{code}

The field "id_prefix" is currently hard-coded.  Perhaps this should be made configurable via
a "field" parameter on the SPLITSHARD command?


> compositeId support for shard splitting
> ---------------------------------------
>
>                 Key: SOLR-13399
>                 URL: https://issues.apache.org/jira/browse/SOLR-13399
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Yonik Seeley
>            Priority: Major
>         Attachments: SOLR-13399.patch
>
>
> Shard splitting does not currently have a way to automatically take into account the
actual distribution (number of documents) in each hash bucket created by using compositeId
hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* command that would
look at the number of docs sharing each compositeId prefix and use that to create roughly
equal sized buckets by document count rather than just assuming an equal distribution across
the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash buckets unless
necessary (since that leads to larger query fanout.) . Perhaps this warrants a parameter that
would control how much of a size mismatch is tolerable before resorting to splitting within
a bucket. *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index the prefix
in a different field.  Iterating over the terms for this field would quickly give us the
number of docs in each (i.e lucene keeps track of the doc count for each term already.) 
Perhaps the implementation could be a flag on the *id* field... something like *indexPrefixes* and poly-fields
that would cause the indexing to be automatically done and alleviate having to pass in an
additional field during indexing and during the call to *SPLITSHARD*.  This whole part is
an optimization though and could be split off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message