lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Solr: separating index and storage
Date Thu, 06 Jun 2013 19:43:12 GMT
bq: I am anticipating that this growth will slow down because there
will be repetitions

This will be true for your indexed data, but NOT for your stored data.
Each stored
field is stored as-is per document. It'll be compressed, so won't take
up the entire
250M, but it'll still be stored.

FWIW,
Erick

On Thu, Jun 6, 2013 at 8:02 AM, Sourajit Basak <sourajit.basac@gmail.com> wrote:
> Each day the index grows by ~250 MB; however I am anticipating that this
> growth will slow down because there will be repetitions (just a guess). Its
> not the order of growth but limitation of our infrastructure. Basically a
> budgetary constraint :-)
>
> Apparently there seems to be no problem than disk space. So we will go
> ahead with the idea of stored fields.
>
>
>
>
> On Thu, Jun 6, 2013 at 5:03 PM, Erick Erickson <erickerickson@gmail.com>wrote:
>
>> By and large, stored fields are pretty irrelevant for resource
>> consumption _except_ for
>> disk space consumed. Sharded systems work fine, the
>> stored data is stored in the index files (*.fdt and *.fdx) files in
>> each segment on each shard.
>>
>> But you haven't told us anything about your data. How much are
>> you talking about here? 100s of G? Terabytes? Other than disk
>> space, You may well be anticipating problems that don't exist...
>>
>> Now, when _returning_ documents the fields must be read, so
>> there is some resource consumption there which you can
>> mitigate with lazy field loading. But this is usually just a few docs
>> so often isn't a problem.
>>
>> Best
>> Erick
>>
>> On Thu, Jun 6, 2013 at 3:34 AM, Sourajit Basak <sourajit.basac@gmail.com>
>> wrote:
>> > Absolutely. Solr will return the reference along the docs/results; those
>> > references may be used to look-up the actual stuff. Such use cases aren't
>> > hard to solve.
>> >
>> > If the use case demands returning the actual stuff alongside the results,
>> > it becomes non-trivial, especially during high loads.
>> >
>> > To avoid this and do a quick implementation I can judiciously create
>> stored
>> > fields and see how it performs. I will need to figure out what happens if
>> > the volume growth of stored fields is high, how much is the disk I/O and
>> > what happens if we shard the index, like, what happens to the stored
>> fields
>> > then.
>> >
>> > Best,
>> > Sourajit
>> >
>> >
>> >
>> >
>> > On Tue, Jun 4, 2013 at 5:31 PM, Erick Erickson <erickerickson@gmail.com
>> >wrote:
>> >
>> >> You have to index something with your Solr documents that
>> >> has meaning in _your_ system so you can find the
>> >> original record. You don't search this field, you just
>> >> return it with the search results and then use it to get
>> >> the original document.
>> >>
>> >> If you're storing the original in a DB, this can be the PK.
>> >> If on a file system the path. etc.
>> >>
>> >> Essentially, since the association is specific to your environment
>> >> you need to handle it explicitly...
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Mon, Jun 3, 2013 at 11:56 AM, Sourajit Basak
>> >> <sourajit.basac@gmail.com> wrote:
>> >> > Consider the following use case.
>> >> >
>> >> > Certain words are extracted from a document and indexed. The exact
>> >> sentence
>> >> > containing the word cannot be stored alongside the extracted word
>> because
>> >> > of the volume at which the documents grow; How can the index and, lets
>> >> call
>> >> > it doc servers be separated ?
>> >> >
>> >> > An option is to store the sentences in MongoDB or a RDBMS. But there
>> >> seems
>> >> > to be a schema level design issue. Assuming 'word' to be a multivalued
>> >> > field, how do we associate to it a reference to the corresponding
>> entry
>> >> in
>> >> > the doc server.
>> >> >
>> >> > May create (word_1, ref_1) tuples. Is there any other in-built
>> feature ?
>> >> >
>> >> > Any related project which separates index & doc servers ?
>> >> >
>> >> > Thanks,
>> >> > Sourajit
>> >>
>>

Mime
View raw message