lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cristian Lorenzetto <>
Subject Re: docid is just a signed int32
Date Sun, 21 Aug 2016 00:28:39 GMT
For my opinion this study dont tell any thing more than before. Obviously if you try to retrieve
all data store in a single query the performance will be not good. Lucene is fantastic But
no magic. The physic laws continue to work also with lucene. The query is designed for retrieving
a small part of a big store, not All The store. In addition i think The time would be worst
also if you dont sort documents. Using a sorted linked list persisted i dont see relevant
delays . Syncerely i dont understand also gc memory limit with lucene algorithm. The size
of memory used is not proporzional to the datastore size, else lucene will be not scalable.
The problem to analize for me is another : considering The trend of big data to encrease in
The last years , considering The classical max size of a database among those we know, considering
The possibility or not to scale up sharding in lucene in arrays defined dinamically or not
, we can evaluate if this refactoring has sense or not. 

Inviato da iPad

> Il giorno 19 ago 2016, alle ore 05:50, Erick Erickson <>
ha scritto:
> OK, I'm a little out of my league here, but I'll plow on anyway....
> bq: There are use cases out there where >2^31 does make sense in a single index
> Ok, let's put some definition to this and define the use-case
> specifically rather than
> be vague. I've just run an experiment for instance where I had 200M
> docs in a single
> shard (very small docs) and tried to sort by a date on all of them.
> Performance on the order of
> 5 seconds. 3B is what, 75 seconds? Does the use-case involve sorting?
> Faceting? If
> so the performance will probably be poor.
> This would be huge surgery I believe, and there hasn't been a
> compelling use-case
> in the search world for it. Unless and until that case is made I
> suspect this idea will
> meet with a lot of resistance.
> That said, I do understand that this is somewhat akin to "Nobody will
> ever need more
> than 64K of ram", meaning that some limits are assumed and eventually become
> outmoded. But given Java's issues with memory and GC I suspect that
> it'll be really
> hard to justify the work this would take.
> Erick
>> On Thu, Aug 18, 2016 at 6:31 PM, Trejkaz <> wrote:
>>> On Thu, Aug 18, 2016 at 11:55 PM, Adrien Grand <> wrote:
>>> No, IndexWriter enforces that the number of documents cannot go over
>>> IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
>>> BaseCompositeReader computes the number of documents in a long variable and
>>> ensures it is less than 2^31, so you cannot have indexes that contain more
>>> than 2^31 documents.
>>> Larger collections should be written to multiple shards and use
>>> TopDocs.merge to merge results.
>> But hang on:
>> * TopDocs#merge still returns a TopDocs.
>> * TopDocs still uses an array of ScoreDoc.
>> * ScoreDoc still uses an int doc ID.
>> Looks like you're still screwed.
>> I wish IndexReader would use long IDs too, because one IndexReader can
>> be across multiple shards too - it doesn't make much sense to me that
>> this is restricted, although "it's hard to fix in a
>> backwards-compatible way" is certainly a good reason. :D
>> TX
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message