lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: paging vs streaming. spawn from (Processing a lot of results in Solr)
Date Sun, 28 Jul 2013 14:41:09 GMT
Shawn had an interesting idea on another thread. It depends
on having basically an identity field (which I see how to do
manually, but don't see how to make work as a new field type
in a distributed environment). And it's brilliantly simple, just
a range query identity:{NNNN TO *]&sort=identity asc.

Then just keep replacing the NNN with the max value you
got in the last packet. Of course this doesn't work if you
need to rank the results, but as a way to process all documents
in a corpus it seems to work. It's certainly not a general solution
to deep paging, but for the limited data dump case...

You could keep from processing the same doc twice (let's
say a doc gets updated and the identity field gets bumped)
by getting the min and max at the start of the dump.

But life is complicated. Siiiigggh. Doesn't work for M/R jobs
that compose an index from pieces either.

FWIW,
Erick


On Sun, Jul 28, 2013 at 1:28 AM, Mikhail Khludnev
<mkhludnev@griddynamics.com> wrote:
> On Sun, Jul 28, 2013 at 1:25 AM, Yonik Seeley <yonik@lucidworks.com> wrote:
>
>>
>> Which part is problematic... the creation of the DocList (the search),
>>
> Literally DocList is a copy of TopDocs. Creating TopDocs is not a search,
> but ranking.
> And ranking costs is log(rows+start) beside of numFound, which the search
> takes.
> Interesting that we still pay that log() even if ask for collecting docs
> as-is with _docid_
>
>
>> or it's memory requirements (an int per doc)?
>>
> TopXxxCollector as well as XxxComparators allocates same [rows+start]
>
> it's clear that after we have deep paging, we need to handle heaps just
> with size of rows (without start).
> It's fairly ok, if we use Solr like site navigation engine, but it's
> 'sub-optimal' for data analytic use-cases, where we need something like
> SELECT * FROM ... in rdbms. In this case any memory allocation on billions
> docs index is a bummer. That's why I'm asking about removing heap based
> collector/comparator.
>
>
>> -Yonik
>> http://lucidworks.com
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mkhludnev@griddynamics.com>

Mime
View raw message