lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luis Cappa Banda <luisca...@gmail.com>
Subject Re: Performance question on Spatial Search
Date Tue, 30 Jul 2013 20:44:21 GMT
Hey, David,

I´ve been reading the thread and I think that is one of the most educative
mail-threads I´ve read in Solr mailing list. Just for curiosity: internally
for Solr, is it the same a query like "field:*" and "field:[* TO *]"? I
think that it´s expected to receive the same number of numFound documents,
but I would like to know the internal behavior of Solr.

Best regards,

- Luis Cappa


2013/7/30 Smiley, David W. <dsmiley@mitre.org>

> Steve,
> The FieldCache and DocValues are irrelevant to this problem.  Solr's
> FilterCache is, and Lucene has no counterpart.  Perhaps it would be cool
> if Solr could look for expensive field:* usages when parsing its queries
> and re-write them to use the FilterCache.  That's quite doable, I think.
> I just created an issue for it:
> https://issues.apache.org/jira/browse/SOLR-5093    but don't expect me to
> work on it anytime soon ;-)
>
>
> ~ David
>
> On 7/30/13 2:02 PM, "Steven Bower" <sbower@alcyon.net> wrote:
>
> >I am curious why the field:* walks the entire terms list.. could this be
> >discovered from a field cache / docvalues?
> >
> >steve
> >
> >
> >On Tue, Jul 30, 2013 at 2:00 PM, Steven Bower <sbower@alcyon.net> wrote:
> >
> >> Until I get the data refed I there was another field (a date field) that
> >> was there and not when the geo field was/was not... i tried that field:*
> >> and query times come down to 2.5s .. also just removing that filter
> >>brings
> >> the query down to 30ms.. so I'm very hopeful that with just a boolean
> >>i'll
> >> be down in that sub 100ms range..
> >>
> >> steve
> >>
> >>
> >> On Tue, Jul 30, 2013 at 12:02 PM, Steven Bower <sbower@alcyon.net>
> >>wrote:
> >>
> >>> Will give the boolean thing a shot... makes sense...
> >>>
> >>>
> >>> On Tue, Jul 30, 2013 at 11:53 AM, Smiley, David W.
> >>><dsmiley@mitre.org>wrote:
> >>>
> >>>> I see the problem ‹ it's +pp:*. It may look innocent but it's a
> >>>> performance killer.  What your telling Lucene to do is iterate over
> >>>> *every* term in this index to find all documents that have this data.
> >>>> Most fields are pretty slow to do that.  Lucene/Solr does not have
> >>>>some
> >>>> kind of cache for this. Instead, you should index a new boolean field
> >>>> indicating wether or not 'pp' is populated and then do a simple true
> >>>> check
> >>>> against that field.  Another approach you could do right now without
> >>>> reindexing is to simplify the last 2 clauses of your 3-clause boolean
> >>>> query by using the "IsDisjointTo" predicate.  But unfortunately Lucene
> >>>> doesn't have a generic filter cache capability and so this predicate
> >>>>has
> >>>> no place to cache the whole-world query it does internally (each and
> >>>> every
> >>>> time it's used), so it will be slower than the boolean field I
> >>>>suggested
> >>>> you add.
> >>>>
> >>>>
> >>>> Nevermind on LatLonType; it doesn't support JTS/Polygons.  There is
> >>>> something close called SpatialPointVectorFieldType that could be
> >>>>modified
> >>>> trivially but it doesn't support it now.
> >>>>
> >>>> ~ David
> >>>>
> >>>> On 7/30/13 11:32 AM, "Steven Bower" <sbower@alcyon.net> wrote:
> >>>>
> >>>> >#1 Here is my query:
> >>>> >
> >>>> >sort=vid asc
> >>>> >start=0
> >>>> >rows=1000
> >>>> >defType=edismax
> >>>> >q=*:*
> >>>> >fq=recordType:"xxx"
> >>>> >fq=vt:"X12B" AND
> >>>> >fq=(cls:"3" OR cls:"8")
> >>>> >fq=dt:[2013-05-08T00:00:00.00Z TO 2013-07-08T00:00:00.00Z]
> >>>> >fq=(vid:86XXX73 OR vid:86XXX20 OR vid:89XXX60 OR vid:89XXX72 OR
> >>>> >vid:89XXX48
> >>>> >OR vid:89XXX31 OR vid:89XXX28 OR vid:89XXX67 OR vid:90XXX76 OR
> >>>> vid:90XXX33
> >>>> >OR vid:90XXX47 OR vid:90XXX97 OR vid:90XXX69 OR vid:90XXX31 OR
> >>>> vid:90XXX44
> >>>> >OR vid:91XXX82 OR vid:91XXX08 OR vid:91XXX32 OR vid:91XXX13 OR
> >>>> vid:91XXX87
> >>>> >OR vid:91XXX82 OR vid:91XXX48 OR vid:91XXX34 OR vid:91XXX31 OR
> >>>> vid:91XXX94
> >>>> >OR vid:91XXX29 OR vid:91XXX31 OR vid:91XXX43 OR vid:91XXX55 OR
> >>>> vid:91XXX67
> >>>> >OR vid:91XXX15 OR vid:91XXX59 OR vid:92XXX95 OR vid:92XXX24 OR
> >>>> vid:92XXX13
> >>>> >OR vid:92XXX07 OR vid:92XXX92 OR vid:92XXX22 OR vid:92XXX25 OR
> >>>> vid:92XXX99
> >>>> >OR vid:92XXX53 OR vid:92XXX55 OR vid:92XXX27 OR vid:92XXX65 OR
> >>>> vid:92XXX41
> >>>> >OR vid:92XXX89 OR vid:92XXX11 OR vid:93XXX45 OR vid:93XXX05 OR
> >>>> vid:93XXX98
> >>>> >OR vid:93XXX70 OR vid:93XXX24 OR vid:93XXX39 OR vid:93XXX69 OR
> >>>> vid:93XXX28
> >>>> >OR vid:93XXX79 OR vid:93XXX66 OR vid:94XXX13 OR vid:94XXX16 OR
> >>>> vid:94XXX10
> >>>> >OR vid:94XXX37 OR vid:94XXX69 OR vid:94XXX29 OR vid:94XXX70 OR
> >>>> vid:94XXX58
> >>>> >OR vid:94XXX08 OR vid:94XXX64 OR vid:94XXX32 OR vid:94XXX44 OR
> >>>> vid:94XXX56
> >>>> >OR vid:95XXX59 OR vid:95XXX72 OR vid:95XXX14 OR vid:95XXX08 OR
> >>>> vid:96XXX10
> >>>> >OR vid:96XXX54 )
> >>>> >fq=gp:"Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0 27.0, 52.0
> >>>>30.0,
> >>>> >47.0
> >>>> >30.0)))" AND NOT pp:"Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0
> >>>> 27.0,
> >>>> >52.0 30.0, 47.0 30.0)))" AND +pp:*
> >>>> >
> >>>> >Basically looking for a set of records by "vid" then if its gp is
in
> >>>>one
> >>>> >polygon and is pp is not in another (and it has a pp)... essentially
> >>>> >looking to see if a record moved between two polygons (gp=current,
> >>>> >pp=prev)
> >>>> >during a time period.
> >>>> >
> >>>> >#2 Yes on JTS (unless from my query above I don't) however this
is
> >>>>only
> >>>> an
> >>>> >initial use case and I suspect we'll need more complex stuff in
the
> >>>> future
> >>>> >
> >>>> >#3 The data is distributed globally but along generally fixed paths
> >>>>and
> >>>> >then clustering around certain areas... for example the polygon
above
> >>>> has
> >>>> >about 11k points (with no date filtering). So basically some areas
> >>>>will
> >>>> be
> >>>> >very dense and most areas not, the majority of searches will be
> >>>>around
> >>>> the
> >>>> >dense areas
> >>>> >
> >>>> >#4 Its very likely to be less than 1M results (with filters) ..
is
> >>>>there
> >>>> >any functinoality loss with LatLonType fields?
> >>>> >
> >>>> >Thanks,
> >>>> >
> >>>> >steve
> >>>> >
> >>>> >
> >>>> >On Tue, Jul 30, 2013 at 10:49 AM, David Smiley (@MITRE.org) <
> >>>> >DSMILEY@mitre.org> wrote:
> >>>> >
> >>>> >> Steve,
> >>>> >> (1)  Can you give a specific example of how your are specifying
the
> >>>> >>spatial
> >>>> >> query?  I'm looking to ensure you are not using "IsWithin",
which
> >>>>is
> >>>> not
> >>>> >> meant for point data.  If your query shape is a circle or the
> >>>>bounding
> >>>> >>box
> >>>> >> of a circle, you should use the geofilt query parser, otherwise
use
> >>>> the
> >>>> >> quirky syntax that allows you to specify the spatial predicate
with
> >>>> >> "Intersects".
> >>>> >> (2) Do you actually need JTS?  i.e. are you using Polygons,
etc.
> >>>> >> (3) How "dense" would you estimate the data is at the 50m
> >>>>resolution
> >>>> >>you've
> >>>> >> configured the data?  If It's very dense then I'll tell you
how to
> >>>> raise
> >>>> >> the
> >>>> >> "prefix grid scan level" to a # closer to max-levels.
> >>>> >> (4) Do all of your searches find less than a million points,
> >>>> considering
> >>>> >> all
> >>>> >> filters?  If so then it's worth comparing the results with
> >>>>LatLonType.
> >>>> >>
> >>>> >> ~ David Smiley
> >>>> >>
> >>>> >>
> >>>> >> Steven Bower wrote
> >>>> >> > @Erick it is alot of hw, but basically trying to create
a "best
> >>>>case
> >>>> >> > scenario" to take HW out of the question. Will try increasing
> >>>>heap
> >>>> >>size
> >>>> >> > tomorrow.. I haven't seen it get close to the max heap
size yet..
> >>>> but
> >>>> >> it's
> >>>> >> > worth trying...
> >>>> >> >
> >>>> >> > Note that these queries look something like:
> >>>> >> >
> >>>> >> > q=*:*
> >>>> >> > fq=[date range]
> >>>> >> > fq=geo query
> >>>> >> >
> >>>> >> > on the fq for the geo query i've added {!cache=false}
to prevent
> >>>>it
> >>>> >>from
> >>>> >> > ending up in the filter cache.. once it's in filter cache
queries
> >>>> come
> >>>> >> > back
> >>>> >> > in 10-20ms. For my use case i need the first unique geo
search
> >>>>query
> >>>> >>to
> >>>> >> > come back in a more reasonable time so I am currently
ignoring
> >>>>the
> >>>> >>cache.
> >>>> >> >
> >>>> >> > @Bill will look into that, I'm not certain it will support
the
> >>>> >>particular
> >>>> >> > queries that are being executed but I'll investigate..
> >>>> >> >
> >>>> >> > steve
> >>>> >> >
> >>>> >> >
> >>>> >> > On Mon, Jul 29, 2013 at 6:25 PM, Erick Erickson &lt;
> >>>> >>
> >>>> >> > erickerickson@
> >>>> >>
> >>>> >> > &gt;wrote:
> >>>> >> >
> >>>> >> >> This is very strange. I'd expect slow queries on
> >>>> >> >> the first few queries while these caches were
> >>>> >> >> warmed, but after that I'd expect things to
> >>>> >> >> be quite fast.
> >>>> >> >>
> >>>> >> >> For a 12G index and 256G RAM, you have on the
> >>>> >> >> surface a LOT of hardware to throw at this problem.
> >>>> >> >> You can _try_ giving the JVM, say, 18G but that
> >>>> >> >> really shouldn't be a big issue, your index files
> >>>> >> >> should be MMaped.
> >>>> >> >>
> >>>> >> >> Let's try the crude thing first and give the JVM
> >>>> >> >> more memory.
> >>>> >> >>
> >>>> >> >> FWIW
> >>>> >> >> Erick
> >>>> >> >>
> >>>> >> >> On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower &lt;
> >>>> >>
> >>>> >> > smb-apache@
> >>>> >>
> >>>> >> > &gt;
> >>>> >> >> wrote:
> >>>> >> >> > I've been doing some performance analysis of
a spacial search
> >>>>use
> >>>> >>case
> >>>> >> >> I'm
> >>>> >> >> > implementing in Solr 4.3.0. Basically I'm seeing
search times
> >>>> alot
> >>>> >> >> higher
> >>>> >> >> > than I'd like them to be and I'm hoping people
may have some
> >>>> >> >> suggestions
> >>>> >> >> > for how to optimize further.
> >>>> >> >> >
> >>>> >> >> > Here are the specs of what I'm doing now:
> >>>> >> >> >
> >>>> >> >> > Machine:
> >>>> >> >> > - 16 cores @ 2.8ghz
> >>>> >> >> > - 256gb RAM
> >>>> >> >> > - 1TB (RAID 1+0 on 10 SSD)
> >>>> >> >> >
> >>>> >> >> > Content:
> >>>> >> >> > - 45M docs (not very big only a few fields with
no large
> >>>>textual
> >>>> >> >> content)
> >>>> >> >> > - 1 geo field (using config below)
> >>>> >> >> > - index is 12gb
> >>>> >> >> > - 1 shard
> >>>> >> >> > - Using MMapDirectory
> >>>> >> >> >
> >>>> >> >> > Field config:
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> > <fieldType name="geo"
> >>>> class="solr.SpatialRecursivePrefixTreeFieldType"
> >>>> >> >>
> >>>> >> >  > distErrPct="0.025" maxDistErr="0.00045"
> >>>> >> >> >
> >>>> >> >>
> >>>> >>
> >>>>
> >>>>
> >>>>>>spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialConte
> >>>>>>xtFa
> >>>> >>ctory"
> >>>> >> >> > units="degrees"/>
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> > <field  name="geopoint" indexed="true" multiValued="false"
> >>>> >> >>
> >>>> >> >  > required="false" stored="true" type="geo"/>
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> > What I've figured out so far:
> >>>> >> >> >
> >>>> >> >> > - Most of my time (98%) is being spent in
> >>>> >> >> > java.nio.Bits.copyToByteArray(long,Object,long,long)
which is
> >>>> being
> >>>> >> >> > driven by
> >>>> >> >>
> >>>>BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
> >>>> >> >> > which from what I gather is basically reading
terms from the
> >>>>.tim
> >>>> >>file
> >>>> >> >> > in blocks
> >>>> >> >> >
> >>>> >> >> > - I moved from Java 1.6 to 1.7 based upon what
I read here:
> >>>> >> >> >
> >>>> >> >>
> >>>> >>
> >>>>
> >>>>
> http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance
> >>>>/
> >>>> >> >> > and it definitely had some positive impact (i
haven't been
> >>>>able
> >>>> to
> >>>> >> >> > measure this independantly yet)
> >>>> >> >> >
> >>>> >> >> > - I changed maxDistErr from 0.000009 (which is
1m precision
> >>>>per
> >>>> >>docs)
> >>>> >> >> > to 0.00045 (50m precision) ..
> >>>> >> >> >
> >>>> >> >> > - It looks to me that the .tim file are being
memory mapped
> >>>>fully
> >>>> >>(ie
> >>>> >> >> > they show up in pmap output) the virtual size
of the jvm is
> >>>>~18gb
> >>>> >> >> > (heap is 6gb)
> >>>> >> >> >
> >>>> >> >> > - I've optimized the index but this doesn't have
a dramatic
> >>>> impact
> >>>> >>on
> >>>> >> >> > performance
> >>>> >> >> >
> >>>> >> >> > Changing the precision and the JVM upgrade yielded
a drop from
> >>>> ~18s
> >>>> >> >> > avg query time to ~9s avg query time.. This is
fantastic but I
> >>>> >>want to
> >>>> >> >> > get this down into the 1-2 second range.
> >>>> >> >> >
> >>>> >> >> > At this point it seems that basically i am bottle-necked
on
> >>>> >>basically
> >>>> >> >> > copying memory out of the mapped .tim file which
leads me to
> >>>> think
> >>>> >> >> > that the only solution to my problem would be
to read less
> >>>>data
> >>>> or
> >>>> >> >> > somehow read it more efficiently..
> >>>> >> >> >
> >>>> >> >> > If anyone has any suggestions of where to go
with this I'd
> >>>>love
> >>>> to
> >>>> >> know
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> > thanks,
> >>>> >> >> >
> >>>> >> >> > steve
> >>>> >> >>
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> -----
> >>>> >>  Author:
> >>>> >>
> http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
> >>>> >> --
> >>>> >> View this message in context:
> >>>> >>
> >>>> >>
> >>>>
> >>>>
> http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Sear
> >>>>ch
> >>>> >>-tp4081150p4081309.html
> >>>> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>>> >>
> >>>>
> >>>>
> >>>
> >>
>
>


-- 
- Luis Cappa

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message