lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Kan <>
Subject Re: hot shard concept
Date Thu, 01 Nov 2012 16:29:33 GMT
Hi Shawn,

Too much technical details are often better, than too little, to my taste
of course :-)
You approach to sharding is apparently hashing based. And that's why you
need to maintain doc did values that come from MySQL in a separate storage
and decide on split point. That's totally legit. And I must say, it is
quite an interesting approach (makes me thing to try something like this
when time permits).

There is another way of looking at sharding. That is the time based
sharding. If your documents (like in our case) are attributed to some point
in time, they belong to their "time" shards. In this case the hot shard is
always the youngest of all your shards in the cluster and slides in time to
the future. Same way as you do when choosing new split point every night
and reindexing, in time based sharding, documents that became too old to be
in the hot shard (sort if they are not hot anymore, that is, not searched
for as often), they can join the cold shards. Thus the cold shards in this
scheme are always growing as hot shard will be pushing more and more
documents into them, while the hot shard remains slim enough to sustain
heavy update and search rate (hot documents are searched for more times,
than the cold ones).

My original question was, basically, what technique to use in order to
physically move hot shard documents to cold shards. Because we want to
decouple the document processing backend (our docs are not stored in any
DB, they are real textual documents) and because our cold shards are very
big (~60GB easily), we wanted to find an optimal solution for the task that
wouldn't require reindexing and  coupling the document processing backend
with SOLR too tight. And this seems to be possible by using low-level
Lucene index partitioning (see some links in my first post).


On Wed, Oct 31, 2012 at 7:50 AM, Shawn Heisey <> wrote:

> On 10/30/2012 5:05 AM, Dmitry Kan wrote:
>> Hi Shawn,
>> Thanks for sharing your story. Let me get it right:
>> How do you keep the incremental shard slim enough over time, do you
>> periodically redistribute the documents from it onto cold shards? If yes,
>> how technically you do it: the Lucene low-level way or Solr / SolrJ way?
> Warning: This email fits nicely into the tl;dr category.  I'm including
> entirely too much information because I'm not sure which bits you're really
> interested in.
> My database and Solr index have two fields that contain unique values.
>  Solr's unique key is what we call the tag_id (alphanumeric), but each
> document also has a MySQL autoincrement field called did, for document id,
> or possibly delete id, which is a tlong in the Solr schema.  The MySQL
> primary key is did.  I divvy up documents among the six cold shards by a
> mod on the crc32 hash (MySQL function) of the did field, my cold shards are
> numbered 0 through 5.  That crc32 hash is not indexed or stored in Solr,
> but now that I think about it, perhaps I should add it to the Solr-specific
> database view.
> The did field is also where I look for my "split point" which marks the
> line between hot and cold.  Values less than or equal to the split point
> are in cold shards, values greater than the split point go in the hot shard.
> Once an hour, my SolrJ build system gets MAX(did) from the database and
> stores it in a JRobin RRD.  Every night, I consult those values and do
> document counts against the database to pick a new split point.  Then I
> index documents between the old split point and the new split point into
> the cold shards, and if that succeeds, I delete the same DID range from the
> hot shard.  I wrote all the code that does this using the SolrJ API,
> storing persistent values in a MySQL database table.  I'm not aware of any
> shortcuts I could use.
> Additional note: Full reindexes are accomplished with the dataimport
> handler, using the following SQL query.  For the hot shard, I pass in a
> modVal of 0,1,2,3,4,5 so that it gets all of the documents in the did range:
>         SELECT * FROM ${dataimporter.request.**dataView}
>         WHERE (
>           (
>             did &gt; ${dataimporter.request.minDid}
>             AND did &lt;= ${dataimporter.request.maxDid}
>           )
>           ${dataimporter.request.**extraWhere}
>         ) AND (crc32(did) % ${dataimporter.request.**numShards})
>           IN (${dataimporter.request.**modVal})
> Back when we first started with Solr 1.4.0, the build system was written
> in Perl (LWP::Simple) and did everything but deletes with the dataimport
> handler.  Deletes were done by query using xml and the /update handler.
> Thanks,
> Shawn


Dmitry Kan

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message