lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Valentin Popov <valentin...@gmail.com>
Subject Re: 500 millions document for loop.
Date Sat, 14 Nov 2015 12:04:25 GMT
Hi, Uwe. 

Thanks for you advise. 

After implementing you suggestion, our calculation time drop down from ~20 days to 3,5 hours.


/**
*
* DocumentFound - callback function for each document
*/
public void iterate(SearchOptions options, final DocumentFound found, final Set<String>
loadFields) throws Exception {
		Query query = options.getQuery();
		Filter queryFilter = options.getQueryFilter();
		final IndexSearcher indexSearcher = new VolumeSearcher(options).newIndexSearcher(Executors.newSingleThreadExecutor());
		
		indexSearcher.search(query, queryFilter, new Collector() {
			
			@Override
			public void setScorer(Scorer arg0) throws IOException { }
			
			@Override
			public void setNextReader(AtomicReaderContext arg0) throws IOException { }
			
			@Override
			public void collect(int docID) throws IOException {
				Document doc = indexSearcher.doc(docID, loadFields);
				found.found(doc);
			}
			
			@Override
			public boolean acceptsDocsOutOfOrder() { 
				return true; 
			}
		});
		
	}


> On 12 нояб. 2015 г., at 21:15, Uwe Schindler <uwe@thetaphi.de> wrote:
> 
> Hi,
> 
>>> The big question is: Do you need the results paged at all?
>> 
>> Yup, because if we return all results, we get OME.
> 
> You get the OME because the paging collector cannot handle that, so this is an XY problem.
Would it not be better if you application just gets the results as a stream and processes
them one after each other? If this is the case (and most statistics need it like that), your
much better to NOT USE TOPDOCS!!!! Your requirement is diametral to getting top-scoring documents!
You want to get ALL results as a sequence.
> 
>>> Do you need them sorted?
>> 
>> Nope.
> 
> OK, so unsorted streaming is the right approach.
> 
>>> If not, the easiest approach is to use a custom Collector that does no
>> sorting and just consumes the results.
>> 
>> Main bottleneck as I see come from next page search, that took ~2-4
>> seconds.
> 
> This is because when paging the collector has to re-execute the whole query and sort
all results again, just with a larger window. So if you have result pages of 50000 results
and you want to get the second page, it will internally sort 100000 results, because the first
page needs to be calculated, too. If you go forward in results the windows gets larger and
larger, until it finally collects all results.
> 
> So just get the results as a stream by implementing the Collector API is the right way
to do this.
> 
>>> 
>>> Uwe
>>> 
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: uwe@thetaphi.de
>>> 
>>>> -----Original Message-----
>>>> From: Valentin Popov [mailto:valentin.po@gmail.com]
>>>> Sent: Thursday, November 12, 2015 6:48 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: 500 millions document for loop.
>>>> 
>>>> Toke, thanks!
>>>> 
>>>> We will look at this solution, looks like this is that what we need.
>>>> 
>>>> 
>>>>> On 12 нояб. 2015 г., at 20:42, Toke Eskildsen <te@statsbiblioteket.dk>
>>>> wrote:
>>>>> 
>>>>> Valentin Popov <valentin.po@gmail.com> wrote:
>>>>> 
>>>>>> We have ~10 indexes for 500M documents, each document
>>>>>> has «archive date», and «to» address, one of our task is
>>>>>> calculate statistics of «to» for last year. Right now we are
>>>>>> using search archive_date:(current_date - 1 year) and paginate
>>>>>> results for 50k records for page. Bottleneck of that approach,
>>>>>> pagination take too long time and on powerful server it take
>>>>>> ~20 days to execute, and it is very long.
>>>>> 
>>>>> Lucene does not like deep page requests due to the way the internal
>>>> Priority Queue works. Solr has CursorMark, which should be fairly simple
>> to
>>>> emulate in your Lucene handling code:
>>>>> 
>>>>> http://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-efficient-
>>>> cursor-based-iteration-of-large-result-sets/
>>>>> 
>>>>> - Toke Eskildsen
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>> 
>>>> 
>>>> Regards,
>>>> Valentin Popov
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> 
>> 
>> 
>> С Уважением,
>> Валентин Попов
>> 
>> 
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


 С Уважением,
Валентин Попов






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message