lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cristian Lorenzetto <cristian.lorenze...@gmail.com>
Subject Re: docid is just a signed int32
Date Sun, 21 Aug 2016 17:35:49 GMT
maybe using TopDocs.merge you can the same query on multiple indexes, with
multireader you can also to make join operation on different indexes

2016-08-21 19:31 GMT+02:00 Cristian Lorenzetto <
cristian.lorenzetto@gmail.com>:

> i m overviewing TopDocs.merge.
>
> What is the difference to use multiple SearchIndexer and then to use
> TopDocs or to use MultiReader?
>
> 2016-08-21 2:28 GMT+02:00 Cristian Lorenzetto <
> cristian.lorenzetto@gmail.com>:
>
>> For my opinion this study dont tell any thing more than before. Obviously
>> if you try to retrieve all data store in a single query the performance
>> will be not good. Lucene is fantastic But no magic. The physic laws
>> continue to work also with lucene. The query is designed for retrieving a
>> small part of a big store, not All The store. In addition i think The time
>> would be worst also if you dont sort documents. Using a sorted linked list
>> persisted i dont see relevant delays . Syncerely i dont understand also gc
>> memory limit with lucene algorithm. The size of memory used is not
>> proporzional to the datastore size, else lucene will be not scalable. The
>> problem to analize for me is another : considering The trend of big data to
>> encrease in The last years , considering The classical max size of a
>> database among those we know, considering The possibility or not to scale
>> up sharding in lucene in arrays defined dinamically or not , we can
>> evaluate if this refactoring has sense or not.
>>
>> Inviato da iPad
>>
>> > Il giorno 19 ago 2016, alle ore 05:50, Erick Erickson <
>> erickerickson@gmail.com> ha scritto:
>> >
>> > OK, I'm a little out of my league here, but I'll plow on anyway....
>> >
>> > bq: There are use cases out there where >2^31 does make sense in a
>> single index
>> >
>> > Ok, let's put some definition to this and define the use-case
>> > specifically rather than
>> > be vague. I've just run an experiment for instance where I had 200M
>> > docs in a single
>> > shard (very small docs) and tried to sort by a date on all of them.
>> > Performance on the order of
>> > 5 seconds. 3B is what, 75 seconds? Does the use-case involve sorting?
>> > Faceting? If
>> > so the performance will probably be poor.
>> >
>> > This would be huge surgery I believe, and there hasn't been a
>> > compelling use-case
>> > in the search world for it. Unless and until that case is made I
>> > suspect this idea will
>> > meet with a lot of resistance.
>> >
>> > That said, I do understand that this is somewhat akin to "Nobody will
>> > ever need more
>> > than 64K of ram", meaning that some limits are assumed and eventually
>> become
>> > outmoded. But given Java's issues with memory and GC I suspect that
>> > it'll be really
>> > hard to justify the work this would take.
>> >
>> > FWIW,
>> > Erick
>> >
>> >
>> >> On Thu, Aug 18, 2016 at 6:31 PM, Trejkaz <trejkaz@trypticon.org>
>> wrote:
>> >>> On Thu, Aug 18, 2016 at 11:55 PM, Adrien Grand <jpountz@gmail.com>
>> wrote:
>> >>> No, IndexWriter enforces that the number of documents cannot go over
>> >>> IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
>> >>> BaseCompositeReader computes the number of documents in a long
>> variable and
>> >>> ensures it is less than 2^31, so you cannot have indexes that contain
>> more
>> >>> than 2^31 documents.
>> >>>
>> >>> Larger collections should be written to multiple shards and use
>> >>> TopDocs.merge to merge results.
>> >>
>> >> But hang on:
>> >> * TopDocs#merge still returns a TopDocs.
>> >> * TopDocs still uses an array of ScoreDoc.
>> >> * ScoreDoc still uses an int doc ID.
>> >>
>> >> Looks like you're still screwed.
>> >>
>> >> I wish IndexReader would use long IDs too, because one IndexReader can
>> >> be across multiple shards too - it doesn't make much sense to me that
>> >> this is restricted, although "it's hard to fix in a
>> >> backwards-compatible way" is certainly a good reason. :D
>> >>
>> >> TX
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message