lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Valentin Popov <valentin...@gmail.com>
Subject Re: 500 millions document for loop.
Date Sat, 14 Nov 2015 12:51:03 GMT
Thank you very much! 


> On 14 нояб. 2015 г., at 15:49, Uwe Schindler <uwe@thetaphi.de> wrote:
> 
> Hi,
> 
> This code is buggy! The collect() call of the collector does not get a document ID relative
to the top-level IndexSearcher, it only gets a document id relative to the reader reported
in setNextReader (which is a atomic reader responsible for a single Lucene index segment).
> 
> In setNextReader, save the reference to the "current" reader. And use this "current"
reader to get the stored fields:
> 
> 		indexSearcher.search(query, queryFilter, new Collector() {
> 			AtomicReader current; 
> 
> 			@Override
> 			public void setScorer(Scorer arg0) throws IOException { }
> 
> 			@Override
> 			public void setNextReader(AtomicReaderContext ctx) throws IOException { 
> 				current = ctx.reader();
> 			}
> 
> 			@Override
> 			public void collect(int docID) throws IOException {
> 				Document doc = current.document(docID, loadFields);
> 				found.found(doc);
> 			}
> 
> 			@Override
> 			public boolean acceptsDocsOutOfOrder() {
> 				return true;
> 			}
> 		});
> 
> Otherwise you get wrong document ids reported!!!
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
>> -----Original Message-----
>> From: Valentin Popov [mailto:valentin.po@gmail.com]
>> Sent: Saturday, November 14, 2015 1:04 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: 500 millions document for loop.
>> 
>> Hi, Uwe.
>> 
>> Thanks for you advise.
>> 
>> After implementing you suggestion, our calculation time drop down from ~20
>> days to 3,5 hours.
>> 
>> /**
>> *
>> * DocumentFound - callback function for each document
>> */
>> public void iterate(SearchOptions options, final DocumentFound found, final
>> Set<String> loadFields) throws Exception {
>> 		Query query = options.getQuery();
>> 		Filter queryFilter = options.getQueryFilter();
>> 		final IndexSearcher indexSearcher = new
>> VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadEx
>> ecutor());
>> 
>> 		indexSearcher.search(query, queryFilter, new Collector() {
>> 
>> 			@Override
>> 			public void setScorer(Scorer arg0) throws IOException
>> { }
>> 
>> 			@Override
>> 			public void setNextReader(AtomicReaderContext
>> arg0) throws IOException { }
>> 
>> 			@Override
>> 			public void collect(int docID) throws IOException {
>> 				Document doc = indexSearcher.doc(docID,
>> loadFields);
>> 				found.found(doc);
>> 			}
>> 
>> 			@Override
>> 			public boolean acceptsDocsOutOfOrder() {
>> 				return true;
>> 			}
>> 		});
>> 
>> 	}
>> 
>> 
>>> On 12 нояб. 2015 г., at 21:15, Uwe Schindler <uwe@thetaphi.de> wrote:
>>> 
>>> Hi,
>>> 
>>>>> The big question is: Do you need the results paged at all?
>>>> 
>>>> Yup, because if we return all results, we get OME.
>>> 
>>> You get the OME because the paging collector cannot handle that, so this is
>> an XY problem. Would it not be better if you application just gets the results
>> as a stream and processes them one after each other? If this is the case (and
>> most statistics need it like that), your much better to NOT USE TOPDOCS!!!!
>> Your requirement is diametral to getting top-scoring documents! You want to
>> get ALL results as a sequence.
>>> 
>>>>> Do you need them sorted?
>>>> 
>>>> Nope.
>>> 
>>> OK, so unsorted streaming is the right approach.
>>> 
>>>>> If not, the easiest approach is to use a custom Collector that does no
>>>> sorting and just consumes the results.
>>>> 
>>>> Main bottleneck as I see come from next page search, that took ~2-4
>>>> seconds.
>>> 
>>> This is because when paging the collector has to re-execute the whole
>> query and sort all results again, just with a larger window. So if you have
>> result pages of 50000 results and you want to get the second page, it will
>> internally sort 100000 results, because the first page needs to be calculated,
>> too. If you go forward in results the windows gets larger and larger, until it
>> finally collects all results.
>>> 
>>> So just get the results as a stream by implementing the Collector API is the
>> right way to do this.
>>> 
>>>>> 
>>>>> Uwe
>>>>> 
>>>>> -----
>>>>> Uwe Schindler
>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>>>> http://www.thetaphi.de
>>>>> eMail: uwe@thetaphi.de
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Valentin Popov [mailto:valentin.po@gmail.com]
>>>>>> Sent: Thursday, November 12, 2015 6:48 PM
>>>>>> To: java-user@lucene.apache.org
>>>>>> Subject: Re: 500 millions document for loop.
>>>>>> 
>>>>>> Toke, thanks!
>>>>>> 
>>>>>> We will look at this solution, looks like this is that what we need.
>>>>>> 
>>>>>> 
>>>>>>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen <te@statsbiblioteket.dk>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Valentin Popov <valentin.po@gmail.com> wrote:
>>>>>>> 
>>>>>>>> We have ~10 indexes for 500M documents, each document
>>>>>>>> has «archive date», and «to» address, one of our task
is
>>>>>>>> calculate statistics of «to» for last year. Right now we
are
>>>>>>>> using search archive_date:(current_date - 1 year) and paginate
>>>>>>>> results for 50k records for page. Bottleneck of that approach,
>>>>>>>> pagination take too long time and on powerful server it take
>>>>>>>> ~20 days to execute, and it is very long.
>>>>>>> 
>>>>>>> Lucene does not like deep page requests due to the way the internal
>>>>>> Priority Queue works. Solr has CursorMark, which should be fairly
>> simple
>>>> to
>>>>>> emulate in your Lucene handling code:
>>>>>>> 
>>>>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-
>> efficient-
>>>>>> cursor-based-iteration-of-large-result-sets/
>>>>>>> 
>>>>>>> - Toke Eskildsen
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Valentin Popov
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>> 
>>>> 
>>>> 
>>>> С Уважением,
>>>> Валентин Попов
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 
>> С Уважением,
>> Валентин Попов
>> 
>> 
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


 С Уважением,
Валентин Попов






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message