lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Smiley, David W." <dsmi...@mitre.org>
Subject Re: Performance question on Spatial Search
Date Tue, 30 Jul 2013 15:53:59 GMT
I see the problem ‹ it's +pp:*. It may look innocent but it's a
performance killer.  What your telling Lucene to do is iterate over
*every* term in this index to find all documents that have this data.
Most fields are pretty slow to do that.  Lucene/Solr does not have some
kind of cache for this. Instead, you should index a new boolean field
indicating wether or not 'pp' is populated and then do a simple true check
against that field.  Another approach you could do right now without
reindexing is to simplify the last 2 clauses of your 3-clause boolean
query by using the "IsDisjointTo" predicate.  But unfortunately Lucene
doesn't have a generic filter cache capability and so this predicate has
no place to cache the whole-world query it does internally (each and every
time it's used), so it will be slower than the boolean field I suggested
you add.


Nevermind on LatLonType; it doesn't support JTS/Polygons.  There is
something close called SpatialPointVectorFieldType that could be modified
trivially but it doesn't support it now.

~ David

On 7/30/13 11:32 AM, "Steven Bower" <sbower@alcyon.net> wrote:

>#1 Here is my query:
>
>sort=vid asc
>start=0
>rows=1000
>defType=edismax
>q=*:*
>fq=recordType:"xxx"
>fq=vt:"X12B" AND
>fq=(cls:"3" OR cls:"8")
>fq=dt:[2013-05-08T00:00:00.00Z TO 2013-07-08T00:00:00.00Z]
>fq=(vid:86XXX73 OR vid:86XXX20 OR vid:89XXX60 OR vid:89XXX72 OR
>vid:89XXX48
>OR vid:89XXX31 OR vid:89XXX28 OR vid:89XXX67 OR vid:90XXX76 OR vid:90XXX33
>OR vid:90XXX47 OR vid:90XXX97 OR vid:90XXX69 OR vid:90XXX31 OR vid:90XXX44
>OR vid:91XXX82 OR vid:91XXX08 OR vid:91XXX32 OR vid:91XXX13 OR vid:91XXX87
>OR vid:91XXX82 OR vid:91XXX48 OR vid:91XXX34 OR vid:91XXX31 OR vid:91XXX94
>OR vid:91XXX29 OR vid:91XXX31 OR vid:91XXX43 OR vid:91XXX55 OR vid:91XXX67
>OR vid:91XXX15 OR vid:91XXX59 OR vid:92XXX95 OR vid:92XXX24 OR vid:92XXX13
>OR vid:92XXX07 OR vid:92XXX92 OR vid:92XXX22 OR vid:92XXX25 OR vid:92XXX99
>OR vid:92XXX53 OR vid:92XXX55 OR vid:92XXX27 OR vid:92XXX65 OR vid:92XXX41
>OR vid:92XXX89 OR vid:92XXX11 OR vid:93XXX45 OR vid:93XXX05 OR vid:93XXX98
>OR vid:93XXX70 OR vid:93XXX24 OR vid:93XXX39 OR vid:93XXX69 OR vid:93XXX28
>OR vid:93XXX79 OR vid:93XXX66 OR vid:94XXX13 OR vid:94XXX16 OR vid:94XXX10
>OR vid:94XXX37 OR vid:94XXX69 OR vid:94XXX29 OR vid:94XXX70 OR vid:94XXX58
>OR vid:94XXX08 OR vid:94XXX64 OR vid:94XXX32 OR vid:94XXX44 OR vid:94XXX56
>OR vid:95XXX59 OR vid:95XXX72 OR vid:95XXX14 OR vid:95XXX08 OR vid:96XXX10
>OR vid:96XXX54 )
>fq=gp:"Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0 27.0, 52.0 30.0,
>47.0
>30.0)))" AND NOT pp:"Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0 27.0,
>52.0 30.0, 47.0 30.0)))" AND +pp:*
>
>Basically looking for a set of records by "vid" then if its gp is in one
>polygon and is pp is not in another (and it has a pp)... essentially
>looking to see if a record moved between two polygons (gp=current,
>pp=prev)
>during a time period.
>
>#2 Yes on JTS (unless from my query above I don't) however this is only an
>initial use case and I suspect we'll need more complex stuff in the future
>
>#3 The data is distributed globally but along generally fixed paths and
>then clustering around certain areas... for example the polygon above has
>about 11k points (with no date filtering). So basically some areas will be
>very dense and most areas not, the majority of searches will be around the
>dense areas
>
>#4 Its very likely to be less than 1M results (with filters) .. is there
>any functinoality loss with LatLonType fields?
>
>Thanks,
>
>steve
>
>
>On Tue, Jul 30, 2013 at 10:49 AM, David Smiley (@MITRE.org) <
>DSMILEY@mitre.org> wrote:
>
>> Steve,
>> (1)  Can you give a specific example of how your are specifying the
>>spatial
>> query?  I'm looking to ensure you are not using "IsWithin", which is not
>> meant for point data.  If your query shape is a circle or the bounding
>>box
>> of a circle, you should use the geofilt query parser, otherwise use the
>> quirky syntax that allows you to specify the spatial predicate with
>> "Intersects".
>> (2) Do you actually need JTS?  i.e. are you using Polygons, etc.
>> (3) How "dense" would you estimate the data is at the 50m resolution
>>you've
>> configured the data?  If It's very dense then I'll tell you how to raise
>> the
>> "prefix grid scan level" to a # closer to max-levels.
>> (4) Do all of your searches find less than a million points, considering
>> all
>> filters?  If so then it's worth comparing the results with LatLonType.
>>
>> ~ David Smiley
>>
>>
>> Steven Bower wrote
>> > @Erick it is alot of hw, but basically trying to create a "best case
>> > scenario" to take HW out of the question. Will try increasing heap
>>size
>> > tomorrow.. I haven't seen it get close to the max heap size yet.. but
>> it's
>> > worth trying...
>> >
>> > Note that these queries look something like:
>> >
>> > q=*:*
>> > fq=[date range]
>> > fq=geo query
>> >
>> > on the fq for the geo query i've added {!cache=false} to prevent it
>>from
>> > ending up in the filter cache.. once it's in filter cache queries come
>> > back
>> > in 10-20ms. For my use case i need the first unique geo search query
>>to
>> > come back in a more reasonable time so I am currently ignoring the
>>cache.
>> >
>> > @Bill will look into that, I'm not certain it will support the
>>particular
>> > queries that are being executed but I'll investigate..
>> >
>> > steve
>> >
>> >
>> > On Mon, Jul 29, 2013 at 6:25 PM, Erick Erickson &lt;
>>
>> > erickerickson@
>>
>> > &gt;wrote:
>> >
>> >> This is very strange. I'd expect slow queries on
>> >> the first few queries while these caches were
>> >> warmed, but after that I'd expect things to
>> >> be quite fast.
>> >>
>> >> For a 12G index and 256G RAM, you have on the
>> >> surface a LOT of hardware to throw at this problem.
>> >> You can _try_ giving the JVM, say, 18G but that
>> >> really shouldn't be a big issue, your index files
>> >> should be MMaped.
>> >>
>> >> Let's try the crude thing first and give the JVM
>> >> more memory.
>> >>
>> >> FWIW
>> >> Erick
>> >>
>> >> On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower &lt;
>>
>> > smb-apache@
>>
>> > &gt;
>> >> wrote:
>> >> > I've been doing some performance analysis of a spacial search use
>>case
>> >> I'm
>> >> > implementing in Solr 4.3.0. Basically I'm seeing search times alot
>> >> higher
>> >> > than I'd like them to be and I'm hoping people may have some
>> >> suggestions
>> >> > for how to optimize further.
>> >> >
>> >> > Here are the specs of what I'm doing now:
>> >> >
>> >> > Machine:
>> >> > - 16 cores @ 2.8ghz
>> >> > - 256gb RAM
>> >> > - 1TB (RAID 1+0 on 10 SSD)
>> >> >
>> >> > Content:
>> >> > - 45M docs (not very big only a few fields with no large textual
>> >> content)
>> >> > - 1 geo field (using config below)
>> >> > - index is 12gb
>> >> > - 1 shard
>> >> > - Using MMapDirectory
>> >> >
>> >> > Field config:
>> >> >
>> >> >
>> > <fieldType name="geo" class="solr.SpatialRecursivePrefixTreeFieldType"
>> >>
>> >  > distErrPct="0.025" maxDistErr="0.00045"
>> >> >
>> >>
>> 
>>spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFa
>>ctory"
>> >> > units="degrees"/>
>> >> >
>> >> >
>> > <field  name="geopoint" indexed="true" multiValued="false"
>> >>
>> >  > required="false" stored="true" type="geo"/>
>> >> >
>> >> >
>> >> > What I've figured out so far:
>> >> >
>> >> > - Most of my time (98%) is being spent in
>> >> > java.nio.Bits.copyToByteArray(long,Object,long,long) which is being
>> >> > driven by
>> >> BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
>> >> > which from what I gather is basically reading terms from the .tim
>>file
>> >> > in blocks
>> >> >
>> >> > - I moved from Java 1.6 to 1.7 based upon what I read here:
>> >> >
>> >>
>> http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance/
>> >> > and it definitely had some positive impact (i haven't been able to
>> >> > measure this independantly yet)
>> >> >
>> >> > - I changed maxDistErr from 0.000009 (which is 1m precision per
>>docs)
>> >> > to 0.00045 (50m precision) ..
>> >> >
>> >> > - It looks to me that the .tim file are being memory mapped fully
>>(ie
>> >> > they show up in pmap output) the virtual size of the jvm is ~18gb
>> >> > (heap is 6gb)
>> >> >
>> >> > - I've optimized the index but this doesn't have a dramatic impact
>>on
>> >> > performance
>> >> >
>> >> > Changing the precision and the JVM upgrade yielded a drop from ~18s
>> >> > avg query time to ~9s avg query time.. This is fantastic but I
>>want to
>> >> > get this down into the 1-2 second range.
>> >> >
>> >> > At this point it seems that basically i am bottle-necked on
>>basically
>> >> > copying memory out of the mapped .tim file which leads me to think
>> >> > that the only solution to my problem would be to read less data or
>> >> > somehow read it more efficiently..
>> >> >
>> >> > If anyone has any suggestions of where to go with this I'd love to
>> know
>> >> >
>> >> >
>> >> > thanks,
>> >> >
>> >> > steve
>> >>
>>
>>
>>
>>
>>
>> -----
>>  Author:
>> http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
>> --
>> View this message in context:
>> 
>>http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Search
>>-tp4081150p4081309.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>


Mime
View raw message