lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerven Tjalling Bolleman <jerven.bolle...@sib.swiss>
Subject Re: docid is just a signed int32
Date Thu, 06 Apr 2017 13:58:37 GMT
Hi All,

I too would like to have doc'ids that are larger than int32. Not today 
but in 4 years that would be very nice ;) Already we are splitting some 
indexes that would be nicer together (mostly allowing more lucene code 
to be used instead of our own)

On the other hand we are not the default use case of lucene. We index 
once a month and then have a frozen index. After "freezing" the index we 
use the doc'ids in lucene to link the search results to our document 
storage. We could use a stored field value instead but for now this 
using of the internal lucene id was a nice optimization.

The closest we are coming to this max index number is in a index of how 
our (uniprot.org) database links to other databases. These are stored as 
very small documents and we have 892,236,174 of them. We can split this 
into lots of smaller indexes, without to much of a hassle. On the other 
hand it would be even nicer to merge them all into a larger index which 
would have 1.5 billion documents as that would allow us to use the 
lucene document joining logic. For now we have our own cross lucene 
index joining logic which is optimized but not optimal.

We get into this problem because we somewhat abuse Lucene to act as more 
than just a text retrieval engine. We actually have a number of custom 
query objects that allow users to integrate certain compute results into 
a lucene search.

Now I understand that splitting indexes etc... into shards is a 
completely reasonable direction. On the other hand we have more than 
acceptable search performance on 800 million document indexes and would 
see no reason why that would not be the case on one 5 times the size. 
Especially considering this performance is achieved on 32GB ram (18GB 
heap) machines with 8 cores today. i.e. for us it would be far cheaper 
to buy bigger machines than to re-architect. I expect that with 
improvements in JVM+GC it would make sense to have 1 or 2 Solr/Elastic 
search nodes on one large machine instead of 5 to 10 that we are hearing 
about on some deployments.

Some of the decisions regarding what we build we would not do today if 
starting from scratch. But considering we started using lucene 10 years 
ago and are current with the latest release the decision to continue 
with our madness makes sense, and would be possible for another 10 years 
if we had 64 bits for a docid.

Again not something for now, but something that would be interesting in 
the java10 time frame.

Regards,
Jerven

P.S. thank you very much for building a great search library and ecosystem.

P.P.S if you want to see the madness in action visit uniprot.org.


On 08/18/2016 05:43 PM, Greg Bowyer wrote:
> What are you trying to index that has more than 3 billion documents per
> shard / index and can not be split as Adrien suggests?
>
>
>
> On Thu, Aug 18, 2016, at 07:35 AM, Cristian Lorenzetto wrote:
>> Maybe lucene has maxsize 2^31 because result set are java array where
>> length is a int type.
>> A suggestion for possible changes in future is to not use java array but
>> Iterator. Iterator is a ADT more scalable , not sucking memory for
>> returning documents.
>>
>>
>> 2016-08-18 16:03 GMT+02:00 Glen Newton <glen.newton@gmail.com>:
>>
>>> Or maybe it is time Lucene re-examined this limit.
>>>
>>> There are use cases out there where >2^31 does make sense in a single index
>>> (huge number of tiny docs).
>>>
>>> Also, I think the underlying hardware and the JDK have advanced to make
>>> this more defendable.
>>>
>>> Constructively,
>>> Glen
>>>
>>>
>>> On Thu, Aug 18, 2016 at 9:55 AM, Adrien Grand <jpountz@gmail.com> wrote:
>>>
>>>> No, IndexWriter enforces that the number of documents cannot go over
>>>> IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
>>>> BaseCompositeReader computes the number of documents in a long variable
>>> and
>>>> ensures it is less than 2^31, so you cannot have indexes that contain
>>> more
>>>> than 2^31 documents.
>>>>
>>>> Larger collections should be written to multiple shards and use
>>>> TopDocs.merge to merge results.
>>>>
>>>> Le jeu. 18 août 2016 à 15:38, Cristian Lorenzetto <
>>>> cristian.lorenzetto@gmail.com> a écrit :
>>>>
>>>>> docid is a signed int32 so it is not so big, but really docid seams
>>> not a
>>>>> primary key unmodifiable but a temporary id for the view related to a
>>>>> specific search.
>>>>>
>>>>> So repository can contains more than 2^31 documents.
>>>>>
>>>>> My deduction is correct ? is there a maximum size for lucene index?
>>>>>
>>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

-- 
-------------------------------------------------------------------
Jerven Bolleman                        Jerven.Bolleman@sib.swiss
SIB Swiss Institute of Bioinformatics  Tel: +41 (0)22 379 58 85
CMU, rue Michel Servet 1               Fax: +41 (0)22 379 58 58
1211 Geneve 4,
Switzerland     www.sib.swiss - www.uniprot.org
Follow us at https://twitter.com/#!/uniprot
-------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message