lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aleksey <bitterc...@gmail.com>
Subject Re: Optimizing NRT search
Date Sat, 04 May 2013 02:14:51 GMT
Yes, GC gets pretty bad even with only 8G of RAM. I also tried using RAM
disk and use SimpleFSDirectory, which performs well and keeps the java heap
small, but with this many indices, it actually keeps hundreds of thousands
files open. This is not really a Lucene question, but could that cause
problems down the line? I haven't yet run it that way for an extended
period of time.


I'm using an individual reader for each index, because I don't really need
to search across them, so no need for MultiReader.

I was actually going to ask about filters in general. I'm unclear how they
work. They look very similar to queries, but on the web some say they are
used to narrow down search result and others say that they can limit the
search space, which seems completely opposite.
Also this ticket https://issues.apache.org/jira/browse/LUCENE-3212 confuses
me a little as it says filtered reader "hides filtered documents by
returning them in getDeletedDocs()". Why "deleted" as opposed to
"filtered"? Are the docs really deleted when filter is applied?
So what kind of scenario will filters provide best performance for over
queries? How about "recycled" docs, say in an application you could move
docs in the "trash" and restore them so that main searches are done over
smaller set. Is that a good use?
What about the docs that are filtered out, can searches/sorting be done
over that or would I need a second negated filter?

Aleksey


Aleksey






On Sat, Apr 27, 2013 at 5:02 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Fri, Apr 26, 2013 at 5:04 PM, Aleksey <bittercold@gmail.com> wrote:
> > Thanks for the response, Mike. Yes, I've come upon your blog before, it's
> > very helpful.
> >
> > I tried bigger batches, it seems the highest throughput I can get is
> > roughly 250 docs a second. From your blog, you updated your index at
> about
> > 1MB per second, with 1K documents, which is 1000/s, but you had 24 core
> > machine, while my laptop has 2 cores (and SSD). So does it mean like the
> > performance I'm seeing is actually better than back in 2011? (By the way
> > I'm using RAMDirectory, rather than MMap, but MMap seems similar).
>
> Be careful with RAMDir ... it's very GC heavy as the index gets larger
> since it breaks each file into 1K byte[]s.  It's best for smallish
> indices.
>
> Your tests are all with one thread?  (My tests were using multiple
> threads on the 24 core machine).  So on a laptop with one thread, 250
> docs/sec where each doc is 1-2 KB seems reasonable.
>
> Still it's odd you don't see larger gains from batching up the changes
> between reopens.
>
> > Interesting thing is that NRTDirectory is about 2x faster when I'm
> updating
> > one document at a time, but batches of 250 take about 1 second for both.
> > I have not tried tuning any components yet because I don't bet understand
> > what exactly all the knobs do.
>
> Well if you're using RAMDir then NRTCachingDir really should not be
> helping much at all!
>
> > Actually, perhaps I should describe my overall use case to see if I
> should
> > be using Lucene in this way at all.
> > My searches never needs to be over entire data set, only over a tiny
> > portion at a time, so I was prototyping a solution that acts kind of
> like a
> > cache. The search fleet holds lots of small Directory instances that can
> be
> > quickly loaded up when necessary and evicted when not in use. Each one is
> > 200-200K docs in size. Updates also happen to individual directories and
> > they are typically in tens of docs rather than hundreds or thousands.
> > I know that having lots of separate directories and searchers is an
> > overhead, but if I had everything in one, then I supposed it would be
> > harder to load and evict portions of it. So am I structuring my
> application
> > in a reasonable way or is there a better way to go about it?
>
> This approach should work.  You use MultiReader to search across them?
>
> You could also use a single reader + filter, or a single reader and
> periodically delete the docs to be evicted.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message