lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Smiley, David W." <dsmi...@mitre.org>
Subject Re: Performance question on Spatial Search
Date Tue, 30 Jul 2013 21:15:29 GMT
Luis,

field:* and field:[* TO *] are semantically equivalent -- they have the
same effect.  But they internally work differently depending on the field
type.  The field type has the chance to intercept the range query to do
something smart (FieldType.getRangeQuery(...)).  Numeric/Date (trie)
fields have a reasonably quick implementation for such queries.  Spatial
fields could be enhanced similarly but aren't (yet).  So in general you
should avoid field:* in favor of field:[* TO *].  Perhaps Solr should
redirect a field:* to the FieldType's getRangeQuery method so that there
is no difference.  Anyway, the official/best way to ask for all data in a
field (without cheating and indexing a boolean in a different field) is
field:[* TO *].

~ David

On 7/30/13 4:44 PM, "Luis Cappa Banda" <luiscappa@gmail.com> wrote:

>Hey, David,
>
>I´ve been reading the thread and I think that is one of the most educative
>mail-threads I´ve read in Solr mailing list. Just for curiosity:
>internally
>for Solr, is it the same a query like "field:*" and "field:[* TO *]"? I
>think that it´s expected to receive the same number of numFound documents,
>but I would like to know the internal behavior of Solr.
>
>Best regards,
>
>- Luis Cappa
>
>
>2013/7/30 Smiley, David W. <dsmiley@mitre.org>
>
>> Steve,
>> The FieldCache and DocValues are irrelevant to this problem.  Solr's
>> FilterCache is, and Lucene has no counterpart.  Perhaps it would be cool
>> if Solr could look for expensive field:* usages when parsing its queries
>> and re-write them to use the FilterCache.  That's quite doable, I think.
>> I just created an issue for it:
>> https://issues.apache.org/jira/browse/SOLR-5093    but don't expect me
>>to
>> work on it anytime soon ;-)
>>
>>
>> ~ David
>>
>> On 7/30/13 2:02 PM, "Steven Bower" <sbower@alcyon.net> wrote:
>>
>> >I am curious why the field:* walks the entire terms list.. could this
>>be
>> >discovered from a field cache / docvalues?
>> >
>> >steve
>> >
>> >
>> >On Tue, Jul 30, 2013 at 2:00 PM, Steven Bower <sbower@alcyon.net>
>>wrote:
>> >
>> >> Until I get the data refed I there was another field (a date field)
>>that
>> >> was there and not when the geo field was/was not... i tried that
>>field:*
>> >> and query times come down to 2.5s .. also just removing that filter
>> >>brings
>> >> the query down to 30ms.. so I'm very hopeful that with just a boolean
>> >>i'll
>> >> be down in that sub 100ms range..
>> >>
>> >> steve
>> >>
>> >>
>> >> On Tue, Jul 30, 2013 at 12:02 PM, Steven Bower <sbower@alcyon.net>
>> >>wrote:
>> >>
>> >>> Will give the boolean thing a shot... makes sense...
>> >>>
>> >>>
>> >>> On Tue, Jul 30, 2013 at 11:53 AM, Smiley, David W.
>> >>><dsmiley@mitre.org>wrote:
>> >>>
>> >>>> I see the problem ‹ it's +pp:*. It may look innocent but it's
a
>> >>>> performance killer.  What your telling Lucene to do is iterate over
>> >>>> *every* term in this index to find all documents that have this
>>data.
>> >>>> Most fields are pretty slow to do that.  Lucene/Solr does not have
>> >>>>some
>> >>>> kind of cache for this. Instead, you should index a new boolean
>>field
>> >>>> indicating wether or not 'pp' is populated and then do a simple
>>true
>> >>>> check
>> >>>> against that field.  Another approach you could do right now
>>without
>> >>>> reindexing is to simplify the last 2 clauses of your 3-clause
>>boolean
>> >>>> query by using the "IsDisjointTo" predicate.  But unfortunately
>>Lucene
>> >>>> doesn't have a generic filter cache capability and so this
>>predicate
>> >>>>has
>> >>>> no place to cache the whole-world query it does internally (each
>>and
>> >>>> every
>> >>>> time it's used), so it will be slower than the boolean field I
>> >>>>suggested
>> >>>> you add.
>> >>>>
>> >>>>
>> >>>> Nevermind on LatLonType; it doesn't support JTS/Polygons.  There
is
>> >>>> something close called SpatialPointVectorFieldType that could be
>> >>>>modified
>> >>>> trivially but it doesn't support it now.
>> >>>>
>> >>>> ~ David
>> >>>>
>> >>>> On 7/30/13 11:32 AM, "Steven Bower" <sbower@alcyon.net> wrote:
>> >>>>
>> >>>> >#1 Here is my query:
>> >>>> >
>> >>>> >sort=vid asc
>> >>>> >start=0
>> >>>> >rows=1000
>> >>>> >defType=edismax
>> >>>> >q=*:*
>> >>>> >fq=recordType:"xxx"
>> >>>> >fq=vt:"X12B" AND
>> >>>> >fq=(cls:"3" OR cls:"8")
>> >>>> >fq=dt:[2013-05-08T00:00:00.00Z TO 2013-07-08T00:00:00.00Z]
>> >>>> >fq=(vid:86XXX73 OR vid:86XXX20 OR vid:89XXX60 OR vid:89XXX72
OR
>> >>>> >vid:89XXX48
>> >>>> >OR vid:89XXX31 OR vid:89XXX28 OR vid:89XXX67 OR vid:90XXX76
OR
>> >>>> vid:90XXX33
>> >>>> >OR vid:90XXX47 OR vid:90XXX97 OR vid:90XXX69 OR vid:90XXX31
OR
>> >>>> vid:90XXX44
>> >>>> >OR vid:91XXX82 OR vid:91XXX08 OR vid:91XXX32 OR vid:91XXX13
OR
>> >>>> vid:91XXX87
>> >>>> >OR vid:91XXX82 OR vid:91XXX48 OR vid:91XXX34 OR vid:91XXX31
OR
>> >>>> vid:91XXX94
>> >>>> >OR vid:91XXX29 OR vid:91XXX31 OR vid:91XXX43 OR vid:91XXX55
OR
>> >>>> vid:91XXX67
>> >>>> >OR vid:91XXX15 OR vid:91XXX59 OR vid:92XXX95 OR vid:92XXX24
OR
>> >>>> vid:92XXX13
>> >>>> >OR vid:92XXX07 OR vid:92XXX92 OR vid:92XXX22 OR vid:92XXX25
OR
>> >>>> vid:92XXX99
>> >>>> >OR vid:92XXX53 OR vid:92XXX55 OR vid:92XXX27 OR vid:92XXX65
OR
>> >>>> vid:92XXX41
>> >>>> >OR vid:92XXX89 OR vid:92XXX11 OR vid:93XXX45 OR vid:93XXX05
OR
>> >>>> vid:93XXX98
>> >>>> >OR vid:93XXX70 OR vid:93XXX24 OR vid:93XXX39 OR vid:93XXX69
OR
>> >>>> vid:93XXX28
>> >>>> >OR vid:93XXX79 OR vid:93XXX66 OR vid:94XXX13 OR vid:94XXX16
OR
>> >>>> vid:94XXX10
>> >>>> >OR vid:94XXX37 OR vid:94XXX69 OR vid:94XXX29 OR vid:94XXX70
OR
>> >>>> vid:94XXX58
>> >>>> >OR vid:94XXX08 OR vid:94XXX64 OR vid:94XXX32 OR vid:94XXX44
OR
>> >>>> vid:94XXX56
>> >>>> >OR vid:95XXX59 OR vid:95XXX72 OR vid:95XXX14 OR vid:95XXX08
OR
>> >>>> vid:96XXX10
>> >>>> >OR vid:96XXX54 )
>> >>>> >fq=gp:"Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0 27.0,
52.0
>> >>>>30.0,
>> >>>> >47.0
>> >>>> >30.0)))" AND NOT pp:"Intersects(POLYGON((47.0 30.0, 47.0 27.0,
>>52.0
>> >>>> 27.0,
>> >>>> >52.0 30.0, 47.0 30.0)))" AND +pp:*
>> >>>> >
>> >>>> >Basically looking for a set of records by "vid" then if its
gp is
>>in
>> >>>>one
>> >>>> >polygon and is pp is not in another (and it has a pp)...
>>essentially
>> >>>> >looking to see if a record moved between two polygons (gp=current,
>> >>>> >pp=prev)
>> >>>> >during a time period.
>> >>>> >
>> >>>> >#2 Yes on JTS (unless from my query above I don't) however this
is
>> >>>>only
>> >>>> an
>> >>>> >initial use case and I suspect we'll need more complex stuff
in
>>the
>> >>>> future
>> >>>> >
>> >>>> >#3 The data is distributed globally but along generally fixed
>>paths
>> >>>>and
>> >>>> >then clustering around certain areas... for example the polygon
>>above
>> >>>> has
>> >>>> >about 11k points (with no date filtering). So basically some
areas
>> >>>>will
>> >>>> be
>> >>>> >very dense and most areas not, the majority of searches will
be
>> >>>>around
>> >>>> the
>> >>>> >dense areas
>> >>>> >
>> >>>> >#4 Its very likely to be less than 1M results (with filters)
.. is
>> >>>>there
>> >>>> >any functinoality loss with LatLonType fields?
>> >>>> >
>> >>>> >Thanks,
>> >>>> >
>> >>>> >steve
>> >>>> >
>> >>>> >
>> >>>> >On Tue, Jul 30, 2013 at 10:49 AM, David Smiley (@MITRE.org)
<
>> >>>> >DSMILEY@mitre.org> wrote:
>> >>>> >
>> >>>> >> Steve,
>> >>>> >> (1)  Can you give a specific example of how your are specifying
>>the
>> >>>> >>spatial
>> >>>> >> query?  I'm looking to ensure you are not using "IsWithin",
>>which
>> >>>>is
>> >>>> not
>> >>>> >> meant for point data.  If your query shape is a circle
or the
>> >>>>bounding
>> >>>> >>box
>> >>>> >> of a circle, you should use the geofilt query parser, otherwise
>>use
>> >>>> the
>> >>>> >> quirky syntax that allows you to specify the spatial predicate
>>with
>> >>>> >> "Intersects".
>> >>>> >> (2) Do you actually need JTS?  i.e. are you using Polygons,
etc.
>> >>>> >> (3) How "dense" would you estimate the data is at the 50m
>> >>>>resolution
>> >>>> >>you've
>> >>>> >> configured the data?  If It's very dense then I'll tell
you how
>>to
>> >>>> raise
>> >>>> >> the
>> >>>> >> "prefix grid scan level" to a # closer to max-levels.
>> >>>> >> (4) Do all of your searches find less than a million points,
>> >>>> considering
>> >>>> >> all
>> >>>> >> filters?  If so then it's worth comparing the results with
>> >>>>LatLonType.
>> >>>> >>
>> >>>> >> ~ David Smiley
>> >>>> >>
>> >>>> >>
>> >>>> >> Steven Bower wrote
>> >>>> >> > @Erick it is alot of hw, but basically trying to create
a
>>"best
>> >>>>case
>> >>>> >> > scenario" to take HW out of the question. Will try
increasing
>> >>>>heap
>> >>>> >>size
>> >>>> >> > tomorrow.. I haven't seen it get close to the max
heap size
>>yet..
>> >>>> but
>> >>>> >> it's
>> >>>> >> > worth trying...
>> >>>> >> >
>> >>>> >> > Note that these queries look something like:
>> >>>> >> >
>> >>>> >> > q=*:*
>> >>>> >> > fq=[date range]
>> >>>> >> > fq=geo query
>> >>>> >> >
>> >>>> >> > on the fq for the geo query i've added {!cache=false}
to
>>prevent
>> >>>>it
>> >>>> >>from
>> >>>> >> > ending up in the filter cache.. once it's in filter
cache
>>queries
>> >>>> come
>> >>>> >> > back
>> >>>> >> > in 10-20ms. For my use case i need the first unique
geo search
>> >>>>query
>> >>>> >>to
>> >>>> >> > come back in a more reasonable time so I am currently
ignoring
>> >>>>the
>> >>>> >>cache.
>> >>>> >> >
>> >>>> >> > @Bill will look into that, I'm not certain it will
support the
>> >>>> >>particular
>> >>>> >> > queries that are being executed but I'll investigate..
>> >>>> >> >
>> >>>> >> > steve
>> >>>> >> >
>> >>>> >> >
>> >>>> >> > On Mon, Jul 29, 2013 at 6:25 PM, Erick Erickson &lt;
>> >>>> >>
>> >>>> >> > erickerickson@
>> >>>> >>
>> >>>> >> > &gt;wrote:
>> >>>> >> >
>> >>>> >> >> This is very strange. I'd expect slow queries
on
>> >>>> >> >> the first few queries while these caches were
>> >>>> >> >> warmed, but after that I'd expect things to
>> >>>> >> >> be quite fast.
>> >>>> >> >>
>> >>>> >> >> For a 12G index and 256G RAM, you have on the
>> >>>> >> >> surface a LOT of hardware to throw at this problem.
>> >>>> >> >> You can _try_ giving the JVM, say, 18G but that
>> >>>> >> >> really shouldn't be a big issue, your index files
>> >>>> >> >> should be MMaped.
>> >>>> >> >>
>> >>>> >> >> Let's try the crude thing first and give the JVM
>> >>>> >> >> more memory.
>> >>>> >> >>
>> >>>> >> >> FWIW
>> >>>> >> >> Erick
>> >>>> >> >>
>> >>>> >> >> On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower
&lt;
>> >>>> >>
>> >>>> >> > smb-apache@
>> >>>> >>
>> >>>> >> > &gt;
>> >>>> >> >> wrote:
>> >>>> >> >> > I've been doing some performance analysis
of a spacial
>>search
>> >>>>use
>> >>>> >>case
>> >>>> >> >> I'm
>> >>>> >> >> > implementing in Solr 4.3.0. Basically I'm
seeing search
>>times
>> >>>> alot
>> >>>> >> >> higher
>> >>>> >> >> > than I'd like them to be and I'm hoping people
may have
>>some
>> >>>> >> >> suggestions
>> >>>> >> >> > for how to optimize further.
>> >>>> >> >> >
>> >>>> >> >> > Here are the specs of what I'm doing now:
>> >>>> >> >> >
>> >>>> >> >> > Machine:
>> >>>> >> >> > - 16 cores @ 2.8ghz
>> >>>> >> >> > - 256gb RAM
>> >>>> >> >> > - 1TB (RAID 1+0 on 10 SSD)
>> >>>> >> >> >
>> >>>> >> >> > Content:
>> >>>> >> >> > - 45M docs (not very big only a few fields
with no large
>> >>>>textual
>> >>>> >> >> content)
>> >>>> >> >> > - 1 geo field (using config below)
>> >>>> >> >> > - index is 12gb
>> >>>> >> >> > - 1 shard
>> >>>> >> >> > - Using MMapDirectory
>> >>>> >> >> >
>> >>>> >> >> > Field config:
>> >>>> >> >> >
>> >>>> >> >> >
>> >>>> >> > <fieldType name="geo"
>> >>>> class="solr.SpatialRecursivePrefixTreeFieldType"
>> >>>> >> >>
>> >>>> >> >  > distErrPct="0.025" maxDistErr="0.00045"
>> >>>> >> >> >
>> >>>> >> >>
>> >>>> >>
>> >>>>
>> >>>>
>> 
>>>>>>>>spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialCon
>>>>>>>>te
>> >>>>>>xtFa
>> >>>> >>ctory"
>> >>>> >> >> > units="degrees"/>
>> >>>> >> >> >
>> >>>> >> >> >
>> >>>> >> > <field  name="geopoint" indexed="true" multiValued="false"
>> >>>> >> >>
>> >>>> >> >  > required="false" stored="true" type="geo"/>
>> >>>> >> >> >
>> >>>> >> >> >
>> >>>> >> >> > What I've figured out so far:
>> >>>> >> >> >
>> >>>> >> >> > - Most of my time (98%) is being spent in
>> >>>> >> >> > java.nio.Bits.copyToByteArray(long,Object,long,long)
which
>>is
>> >>>> being
>> >>>> >> >> > driven by
>> >>>> >> >>
>> >>>>BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
>> >>>> >> >> > which from what I gather is basically reading
terms from
>>the
>> >>>>.tim
>> >>>> >>file
>> >>>> >> >> > in blocks
>> >>>> >> >> >
>> >>>> >> >> > - I moved from Java 1.6 to 1.7 based upon
what I read here:
>> >>>> >> >> >
>> >>>> >> >>
>> >>>> >>
>> >>>>
>> >>>>
>> http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance
>> >>>>/
>> >>>> >> >> > and it definitely had some positive impact
(i haven't been
>> >>>>able
>> >>>> to
>> >>>> >> >> > measure this independantly yet)
>> >>>> >> >> >
>> >>>> >> >> > - I changed maxDistErr from 0.000009 (which
is 1m precision
>> >>>>per
>> >>>> >>docs)
>> >>>> >> >> > to 0.00045 (50m precision) ..
>> >>>> >> >> >
>> >>>> >> >> > - It looks to me that the .tim file are being
memory mapped
>> >>>>fully
>> >>>> >>(ie
>> >>>> >> >> > they show up in pmap output) the virtual
size of the jvm is
>> >>>>~18gb
>> >>>> >> >> > (heap is 6gb)
>> >>>> >> >> >
>> >>>> >> >> > - I've optimized the index but this doesn't
have a dramatic
>> >>>> impact
>> >>>> >>on
>> >>>> >> >> > performance
>> >>>> >> >> >
>> >>>> >> >> > Changing the precision and the JVM upgrade
yielded a drop
>>from
>> >>>> ~18s
>> >>>> >> >> > avg query time to ~9s avg query time.. This
is fantastic
>>but I
>> >>>> >>want to
>> >>>> >> >> > get this down into the 1-2 second range.
>> >>>> >> >> >
>> >>>> >> >> > At this point it seems that basically i am
bottle-necked on
>> >>>> >>basically
>> >>>> >> >> > copying memory out of the mapped .tim file
which leads me
>>to
>> >>>> think
>> >>>> >> >> > that the only solution to my problem would
be to read less
>> >>>>data
>> >>>> or
>> >>>> >> >> > somehow read it more efficiently..
>> >>>> >> >> >
>> >>>> >> >> > If anyone has any suggestions of where to
go with this I'd
>> >>>>love
>> >>>> to
>> >>>> >> know
>> >>>> >> >> >
>> >>>> >> >> >
>> >>>> >> >> > thanks,
>> >>>> >> >> >
>> >>>> >> >> > steve
>> >>>> >> >>
>> >>>> >>
>> >>>> >>
>> >>>> >>
>> >>>> >>
>> >>>> >>
>> >>>> >> -----
>> >>>> >>  Author:
>> >>>> >>
>> http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
>> >>>> >> --
>> >>>> >> View this message in context:
>> >>>> >>
>> >>>> >>
>> >>>>
>> >>>>
>> http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Sear
>> >>>>ch
>> >>>> >>-tp4081150p4081309.html
>> >>>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>>> >>
>> >>>>
>> >>>>
>> >>>
>> >>
>>
>>
>
>
>-- 
>- Luis Cappa

Mime
View raw message