lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Bernstein <joels...@gmail.com>
Subject Re: retrieving large number of docs
Date Wed, 03 Jun 2015 18:33:52 GMT
Erick makes a great point, if they are in the same VM try the cross-core
join first. It might be fast enough for you.

A custom solution would be to build a custom query or post filter that
works with your specific scenario. For example if the docID's are integers
you could build a fast PostFilter using data structures best suited for
integer filters.

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Jun 3, 2015 at 2:23 PM, Robust Links <peyman@robustlinks.com> wrote:

> what would be a custom solution?
>
>
> On Wed, Jun 3, 2015 at 1:58 PM, Joel Bernstein <joelsolr@gmail.com> wrote:
>
> > You may have to do something custom to meet your needs.
> >
> > 10,000 DocID's is not huge but you're latency requirement are pretty low.
> >
> > Are your DocID's by any chance integers? This can make custom PostFilters
> > run much faster.
> >
> > You should also be aware of the Streaming API in Solr 5.1 which will give
> > you fast Map/Reduce approaches (
> >
> http://joelsolr.blogspot.com/2015/04/the-streaming-api-solrjio-basics.html
> > ).
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Wed, Jun 3, 2015 at 1:46 PM, Robust Links <peyman@robustlinks.com>
> > wrote:
> >
> > > Hey Joel
> > >
> > > see below
> > >
> > > On Wed, Jun 3, 2015 at 1:43 PM, Joel Bernstein <joelsolr@gmail.com>
> > wrote:
> > >
> > > > A few questions for you:
> > > >
> > > > How large can the list of filtering ID's be?
> > > >
> > >
> > > >> 10k
> > >
> > >
> > > >
> > > > What's your expectation on latency?
> > > >
> > >
> > > 10> latency <100
> > >
> > >
> > > >
> > > > What version of Solr are you using?
> > > >
> > >
> > > 5.0.0
> > >
> > >
> > > >
> > > > SolrCloud or not?
> > > >
> > >
> > > not
> > >
> > >
> > >
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > > On Wed, Jun 3, 2015 at 1:23 PM, Robust Links <peyman@robustlinks.com
> >
> > > > wrote:
> > > >
> > > > > Hi
> > > > >
> > > > > I have a set of document IDs from one core and i want to query
> > another
> > > > core
> > > > > using the ids retrieved from the first core...the constraint is
> that
> > > the
> > > > > size of doc ID set can be very large. I want to:
> > > > >
> > > > > 1) retrieve these docs from the 2nd index
> > > > > 2) facet on the results
> > > > >
> > > > > I can think of 3 solutions:
> > > > >
> > > > > 1) boolean query
> > > > > 2) terms fq
> > > > > 3) use a DB rather than Solr
> > > > >
> > > > > I am trying to keep latencies down so prefer to not use (3). The
> > > problem
> > > > > with (1) is maxBooleanclauses is hardwired and I am not sure when
I
> > > will
> > > > > hit the exception. Option (2) seems to also hit limits.. so if I
do
> > > > >
> > > > > select?fl=*&q=*:*&facet=true&facet.field=title&fq={!terms
> > > > > f=id}<LONG_LIST_OF_IDS>
> > > > >
> > > > > solr just goes blank. I have tried adding cost=200 to try to run
> the
> > > > query
> > > > > first fq={!terms f=id cost=200} but still no good. Paging on doc
> IDs
> > > > could
> > > > > be a solution but the problem then is that the faceting results
> > > > correspond
> > > > > to the paged IDs and not the global set.
> > > > >
> > > > > My filter cache spec is as follows
> > > > >
> > > > >   <filterCache class="solr.FastLRUCache"
> > > > >                  size="1000000"
> > > > >                  initialSize="1000000"
> > > > >                  autowarmCount="100000"/>
> > > > >
> > > > >
> > > > > What would be the best way for me to solve this problem?
> > > > >
> > > > > thank you
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message