lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luis Cappa Banda <luisca...@gmail.com>
Subject Re: Performance question on Spatial Search
Date Tue, 30 Jul 2013 21:33:31 GMT
Thank you very much, David. That was a great explanation!

Regards,

- Luis Cappa


2013/7/30 Smiley, David W. <dsmiley@mitre.org>

> Luis,
>
> field:* and field:[* TO *] are semantically equivalent -- they have the
> same effect.  But they internally work differently depending on the field
> type.  The field type has the chance to intercept the range query to do
> something smart (FieldType.getRangeQuery(...)).  Numeric/Date (trie)
> fields have a reasonably quick implementation for such queries.  Spatial
> fields could be enhanced similarly but aren't (yet).  So in general you
> should avoid field:* in favor of field:[* TO *].  Perhaps Solr should
> redirect a field:* to the FieldType's getRangeQuery method so that there
> is no difference.  Anyway, the official/best way to ask for all data in a
> field (without cheating and indexing a boolean in a different field) is
> field:[* TO *].
>
> ~ David
>
> On 7/30/13 4:44 PM, "Luis Cappa Banda" <luiscappa@gmail.com> wrote:
>
> >Hey, David,
> >
> >I´ve been reading the thread and I think that is one of the most educative
> >mail-threads I´ve read in Solr mailing list. Just for curiosity:
> >internally
> >for Solr, is it the same a query like "field:*" and "field:[* TO *]"? I
> >think that it´s expected to receive the same number of numFound documents,
> >but I would like to know the internal behavior of Solr.
> >
> >Best regards,
> >
> >- Luis Cappa
> >
> >
> >2013/7/30 Smiley, David W. <dsmiley@mitre.org>
> >
> >> Steve,
> >> The FieldCache and DocValues are irrelevant to this problem.  Solr's
> >> FilterCache is, and Lucene has no counterpart.  Perhaps it would be cool
> >> if Solr could look for expensive field:* usages when parsing its queries
> >> and re-write them to use the FilterCache.  That's quite doable, I think.
> >> I just created an issue for it:
> >> https://issues.apache.org/jira/browse/SOLR-5093    but don't expect me
> >>to
> >> work on it anytime soon ;-)
> >>
> >>
> >> ~ David
> >>
> >> On 7/30/13 2:02 PM, "Steven Bower" <sbower@alcyon.net> wrote:
> >>
> >> >I am curious why the field:* walks the entire terms list.. could this
> >>be
> >> >discovered from a field cache / docvalues?
> >> >
> >> >steve
> >> >
> >> >
> >> >On Tue, Jul 30, 2013 at 2:00 PM, Steven Bower <sbower@alcyon.net>
> >>wrote:
> >> >
> >> >> Until I get the data refed I there was another field (a date field)
> >>that
> >> >> was there and not when the geo field was/was not... i tried that
> >>field:*
> >> >> and query times come down to 2.5s .. also just removing that filter
> >> >>brings
> >> >> the query down to 30ms.. so I'm very hopeful that with just a boolean
> >> >>i'll
> >> >> be down in that sub 100ms range..
> >> >>
> >> >> steve
> >> >>
> >> >>
> >> >> On Tue, Jul 30, 2013 at 12:02 PM, Steven Bower <sbower@alcyon.net>
> >> >>wrote:
> >> >>
> >> >>> Will give the boolean thing a shot... makes sense...
> >> >>>
> >> >>>
> >> >>> On Tue, Jul 30, 2013 at 11:53 AM, Smiley, David W.
> >> >>><dsmiley@mitre.org>wrote:
> >> >>>
> >> >>>> I see the problem ‹ it's +pp:*. It may look innocent but
it's a
> >> >>>> performance killer.  What your telling Lucene to do is iterate
over
> >> >>>> *every* term in this index to find all documents that have
this
> >>data.
> >> >>>> Most fields are pretty slow to do that.  Lucene/Solr does not
have
> >> >>>>some
> >> >>>> kind of cache for this. Instead, you should index a new boolean
> >>field
> >> >>>> indicating wether or not 'pp' is populated and then do a simple
> >>true
> >> >>>> check
> >> >>>> against that field.  Another approach you could do right now
> >>without
> >> >>>> reindexing is to simplify the last 2 clauses of your 3-clause
> >>boolean
> >> >>>> query by using the "IsDisjointTo" predicate.  But unfortunately
> >>Lucene
> >> >>>> doesn't have a generic filter cache capability and so this
> >>predicate
> >> >>>>has
> >> >>>> no place to cache the whole-world query it does internally
(each
> >>and
> >> >>>> every
> >> >>>> time it's used), so it will be slower than the boolean field
I
> >> >>>>suggested
> >> >>>> you add.
> >> >>>>
> >> >>>>
> >> >>>> Nevermind on LatLonType; it doesn't support JTS/Polygons. 
There is
> >> >>>> something close called SpatialPointVectorFieldType that could
be
> >> >>>>modified
> >> >>>> trivially but it doesn't support it now.
> >> >>>>
> >> >>>> ~ David
> >> >>>>
> >> >>>> On 7/30/13 11:32 AM, "Steven Bower" <sbower@alcyon.net>
wrote:
> >> >>>>
> >> >>>> >#1 Here is my query:
> >> >>>> >
> >> >>>> >sort=vid asc
> >> >>>> >start=0
> >> >>>> >rows=1000
> >> >>>> >defType=edismax
> >> >>>> >q=*:*
> >> >>>> >fq=recordType:"xxx"
> >> >>>> >fq=vt:"X12B" AND
> >> >>>> >fq=(cls:"3" OR cls:"8")
> >> >>>> >fq=dt:[2013-05-08T00:00:00.00Z TO 2013-07-08T00:00:00.00Z]
> >> >>>> >fq=(vid:86XXX73 OR vid:86XXX20 OR vid:89XXX60 OR vid:89XXX72
OR
> >> >>>> >vid:89XXX48
> >> >>>> >OR vid:89XXX31 OR vid:89XXX28 OR vid:89XXX67 OR vid:90XXX76
OR
> >> >>>> vid:90XXX33
> >> >>>> >OR vid:90XXX47 OR vid:90XXX97 OR vid:90XXX69 OR vid:90XXX31
OR
> >> >>>> vid:90XXX44
> >> >>>> >OR vid:91XXX82 OR vid:91XXX08 OR vid:91XXX32 OR vid:91XXX13
OR
> >> >>>> vid:91XXX87
> >> >>>> >OR vid:91XXX82 OR vid:91XXX48 OR vid:91XXX34 OR vid:91XXX31
OR
> >> >>>> vid:91XXX94
> >> >>>> >OR vid:91XXX29 OR vid:91XXX31 OR vid:91XXX43 OR vid:91XXX55
OR
> >> >>>> vid:91XXX67
> >> >>>> >OR vid:91XXX15 OR vid:91XXX59 OR vid:92XXX95 OR vid:92XXX24
OR
> >> >>>> vid:92XXX13
> >> >>>> >OR vid:92XXX07 OR vid:92XXX92 OR vid:92XXX22 OR vid:92XXX25
OR
> >> >>>> vid:92XXX99
> >> >>>> >OR vid:92XXX53 OR vid:92XXX55 OR vid:92XXX27 OR vid:92XXX65
OR
> >> >>>> vid:92XXX41
> >> >>>> >OR vid:92XXX89 OR vid:92XXX11 OR vid:93XXX45 OR vid:93XXX05
OR
> >> >>>> vid:93XXX98
> >> >>>> >OR vid:93XXX70 OR vid:93XXX24 OR vid:93XXX39 OR vid:93XXX69
OR
> >> >>>> vid:93XXX28
> >> >>>> >OR vid:93XXX79 OR vid:93XXX66 OR vid:94XXX13 OR vid:94XXX16
OR
> >> >>>> vid:94XXX10
> >> >>>> >OR vid:94XXX37 OR vid:94XXX69 OR vid:94XXX29 OR vid:94XXX70
OR
> >> >>>> vid:94XXX58
> >> >>>> >OR vid:94XXX08 OR vid:94XXX64 OR vid:94XXX32 OR vid:94XXX44
OR
> >> >>>> vid:94XXX56
> >> >>>> >OR vid:95XXX59 OR vid:95XXX72 OR vid:95XXX14 OR vid:95XXX08
OR
> >> >>>> vid:96XXX10
> >> >>>> >OR vid:96XXX54 )
> >> >>>> >fq=gp:"Intersects(POLYGON((47.0 30.0, 47.0 27.0, 52.0 27.0,
52.0
> >> >>>>30.0,
> >> >>>> >47.0
> >> >>>> >30.0)))" AND NOT pp:"Intersects(POLYGON((47.0 30.0, 47.0
27.0,
> >>52.0
> >> >>>> 27.0,
> >> >>>> >52.0 30.0, 47.0 30.0)))" AND +pp:*
> >> >>>> >
> >> >>>> >Basically looking for a set of records by "vid" then if
its gp is
> >>in
> >> >>>>one
> >> >>>> >polygon and is pp is not in another (and it has a pp)...
> >>essentially
> >> >>>> >looking to see if a record moved between two polygons (gp=current,
> >> >>>> >pp=prev)
> >> >>>> >during a time period.
> >> >>>> >
> >> >>>> >#2 Yes on JTS (unless from my query above I don't) however
this is
> >> >>>>only
> >> >>>> an
> >> >>>> >initial use case and I suspect we'll need more complex
stuff in
> >>the
> >> >>>> future
> >> >>>> >
> >> >>>> >#3 The data is distributed globally but along generally
fixed
> >>paths
> >> >>>>and
> >> >>>> >then clustering around certain areas... for example the
polygon
> >>above
> >> >>>> has
> >> >>>> >about 11k points (with no date filtering). So basically
some areas
> >> >>>>will
> >> >>>> be
> >> >>>> >very dense and most areas not, the majority of searches
will be
> >> >>>>around
> >> >>>> the
> >> >>>> >dense areas
> >> >>>> >
> >> >>>> >#4 Its very likely to be less than 1M results (with filters)
.. is
> >> >>>>there
> >> >>>> >any functinoality loss with LatLonType fields?
> >> >>>> >
> >> >>>> >Thanks,
> >> >>>> >
> >> >>>> >steve
> >> >>>> >
> >> >>>> >
> >> >>>> >On Tue, Jul 30, 2013 at 10:49 AM, David Smiley (@MITRE.org)
<
> >> >>>> >DSMILEY@mitre.org> wrote:
> >> >>>> >
> >> >>>> >> Steve,
> >> >>>> >> (1)  Can you give a specific example of how your are
specifying
> >>the
> >> >>>> >>spatial
> >> >>>> >> query?  I'm looking to ensure you are not using "IsWithin",
> >>which
> >> >>>>is
> >> >>>> not
> >> >>>> >> meant for point data.  If your query shape is a circle
or the
> >> >>>>bounding
> >> >>>> >>box
> >> >>>> >> of a circle, you should use the geofilt query parser,
otherwise
> >>use
> >> >>>> the
> >> >>>> >> quirky syntax that allows you to specify the spatial
predicate
> >>with
> >> >>>> >> "Intersects".
> >> >>>> >> (2) Do you actually need JTS?  i.e. are you using
Polygons, etc.
> >> >>>> >> (3) How "dense" would you estimate the data is at
the 50m
> >> >>>>resolution
> >> >>>> >>you've
> >> >>>> >> configured the data?  If It's very dense then I'll
tell you how
> >>to
> >> >>>> raise
> >> >>>> >> the
> >> >>>> >> "prefix grid scan level" to a # closer to max-levels.
> >> >>>> >> (4) Do all of your searches find less than a million
points,
> >> >>>> considering
> >> >>>> >> all
> >> >>>> >> filters?  If so then it's worth comparing the results
with
> >> >>>>LatLonType.
> >> >>>> >>
> >> >>>> >> ~ David Smiley
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> Steven Bower wrote
> >> >>>> >> > @Erick it is alot of hw, but basically trying
to create a
> >>"best
> >> >>>>case
> >> >>>> >> > scenario" to take HW out of the question. Will
try increasing
> >> >>>>heap
> >> >>>> >>size
> >> >>>> >> > tomorrow.. I haven't seen it get close to the
max heap size
> >>yet..
> >> >>>> but
> >> >>>> >> it's
> >> >>>> >> > worth trying...
> >> >>>> >> >
> >> >>>> >> > Note that these queries look something like:
> >> >>>> >> >
> >> >>>> >> > q=*:*
> >> >>>> >> > fq=[date range]
> >> >>>> >> > fq=geo query
> >> >>>> >> >
> >> >>>> >> > on the fq for the geo query i've added {!cache=false}
to
> >>prevent
> >> >>>>it
> >> >>>> >>from
> >> >>>> >> > ending up in the filter cache.. once it's in
filter cache
> >>queries
> >> >>>> come
> >> >>>> >> > back
> >> >>>> >> > in 10-20ms. For my use case i need the first
unique geo search
> >> >>>>query
> >> >>>> >>to
> >> >>>> >> > come back in a more reasonable time so I am currently
ignoring
> >> >>>>the
> >> >>>> >>cache.
> >> >>>> >> >
> >> >>>> >> > @Bill will look into that, I'm not certain it
will support the
> >> >>>> >>particular
> >> >>>> >> > queries that are being executed but I'll investigate..
> >> >>>> >> >
> >> >>>> >> > steve
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> > On Mon, Jul 29, 2013 at 6:25 PM, Erick Erickson
&lt;
> >> >>>> >>
> >> >>>> >> > erickerickson@
> >> >>>> >>
> >> >>>> >> > &gt;wrote:
> >> >>>> >> >
> >> >>>> >> >> This is very strange. I'd expect slow queries
on
> >> >>>> >> >> the first few queries while these caches
were
> >> >>>> >> >> warmed, but after that I'd expect things
to
> >> >>>> >> >> be quite fast.
> >> >>>> >> >>
> >> >>>> >> >> For a 12G index and 256G RAM, you have on
the
> >> >>>> >> >> surface a LOT of hardware to throw at this
problem.
> >> >>>> >> >> You can _try_ giving the JVM, say, 18G but
that
> >> >>>> >> >> really shouldn't be a big issue, your index
files
> >> >>>> >> >> should be MMaped.
> >> >>>> >> >>
> >> >>>> >> >> Let's try the crude thing first and give
the JVM
> >> >>>> >> >> more memory.
> >> >>>> >> >>
> >> >>>> >> >> FWIW
> >> >>>> >> >> Erick
> >> >>>> >> >>
> >> >>>> >> >> On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower
&lt;
> >> >>>> >>
> >> >>>> >> > smb-apache@
> >> >>>> >>
> >> >>>> >> > &gt;
> >> >>>> >> >> wrote:
> >> >>>> >> >> > I've been doing some performance analysis
of a spacial
> >>search
> >> >>>>use
> >> >>>> >>case
> >> >>>> >> >> I'm
> >> >>>> >> >> > implementing in Solr 4.3.0. Basically
I'm seeing search
> >>times
> >> >>>> alot
> >> >>>> >> >> higher
> >> >>>> >> >> > than I'd like them to be and I'm hoping
people may have
> >>some
> >> >>>> >> >> suggestions
> >> >>>> >> >> > for how to optimize further.
> >> >>>> >> >> >
> >> >>>> >> >> > Here are the specs of what I'm doing
now:
> >> >>>> >> >> >
> >> >>>> >> >> > Machine:
> >> >>>> >> >> > - 16 cores @ 2.8ghz
> >> >>>> >> >> > - 256gb RAM
> >> >>>> >> >> > - 1TB (RAID 1+0 on 10 SSD)
> >> >>>> >> >> >
> >> >>>> >> >> > Content:
> >> >>>> >> >> > - 45M docs (not very big only a few
fields with no large
> >> >>>>textual
> >> >>>> >> >> content)
> >> >>>> >> >> > - 1 geo field (using config below)
> >> >>>> >> >> > - index is 12gb
> >> >>>> >> >> > - 1 shard
> >> >>>> >> >> > - Using MMapDirectory
> >> >>>> >> >> >
> >> >>>> >> >> > Field config:
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> > <fieldType name="geo"
> >> >>>> class="solr.SpatialRecursivePrefixTreeFieldType"
> >> >>>> >> >>
> >> >>>> >> >  > distErrPct="0.025" maxDistErr="0.00045"
> >> >>>> >> >> >
> >> >>>> >> >>
> >> >>>> >>
> >> >>>>
> >> >>>>
> >>
> >>>>>>>>spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialCon
> >>>>>>>>te
> >> >>>>>>xtFa
> >> >>>> >>ctory"
> >> >>>> >> >> > units="degrees"/>
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> > <field  name="geopoint" indexed="true" multiValued="false"
> >> >>>> >> >>
> >> >>>> >> >  > required="false" stored="true" type="geo"/>
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> > What I've figured out so far:
> >> >>>> >> >> >
> >> >>>> >> >> > - Most of my time (98%) is being spent
in
> >> >>>> >> >> > java.nio.Bits.copyToByteArray(long,Object,long,long)
which
> >>is
> >> >>>> being
> >> >>>> >> >> > driven by
> >> >>>> >> >>
> >> >>>>BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
> >> >>>> >> >> > which from what I gather is basically
reading terms from
> >>the
> >> >>>>.tim
> >> >>>> >>file
> >> >>>> >> >> > in blocks
> >> >>>> >> >> >
> >> >>>> >> >> > - I moved from Java 1.6 to 1.7 based
upon what I read here:
> >> >>>> >> >> >
> >> >>>> >> >>
> >> >>>> >>
> >> >>>>
> >> >>>>
> >> http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance
> >> >>>>/
> >> >>>> >> >> > and it definitely had some positive
impact (i haven't been
> >> >>>>able
> >> >>>> to
> >> >>>> >> >> > measure this independantly yet)
> >> >>>> >> >> >
> >> >>>> >> >> > - I changed maxDistErr from 0.000009
(which is 1m precision
> >> >>>>per
> >> >>>> >>docs)
> >> >>>> >> >> > to 0.00045 (50m precision) ..
> >> >>>> >> >> >
> >> >>>> >> >> > - It looks to me that the .tim file
are being memory mapped
> >> >>>>fully
> >> >>>> >>(ie
> >> >>>> >> >> > they show up in pmap output) the virtual
size of the jvm is
> >> >>>>~18gb
> >> >>>> >> >> > (heap is 6gb)
> >> >>>> >> >> >
> >> >>>> >> >> > - I've optimized the index but this
doesn't have a dramatic
> >> >>>> impact
> >> >>>> >>on
> >> >>>> >> >> > performance
> >> >>>> >> >> >
> >> >>>> >> >> > Changing the precision and the JVM upgrade
yielded a drop
> >>from
> >> >>>> ~18s
> >> >>>> >> >> > avg query time to ~9s avg query time..
This is fantastic
> >>but I
> >> >>>> >>want to
> >> >>>> >> >> > get this down into the 1-2 second range.
> >> >>>> >> >> >
> >> >>>> >> >> > At this point it seems that basically
i am bottle-necked on
> >> >>>> >>basically
> >> >>>> >> >> > copying memory out of the mapped .tim
file which leads me
> >>to
> >> >>>> think
> >> >>>> >> >> > that the only solution to my problem
would be to read less
> >> >>>>data
> >> >>>> or
> >> >>>> >> >> > somehow read it more efficiently..
> >> >>>> >> >> >
> >> >>>> >> >> > If anyone has any suggestions of where
to go with this I'd
> >> >>>>love
> >> >>>> to
> >> >>>> >> know
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> > thanks,
> >> >>>> >> >> >
> >> >>>> >> >> > steve
> >> >>>> >> >>
> >> >>>> >>
> >> >>>> >>
> >> >>>> >>
> >> >>>> >>
> >> >>>> >>
> >> >>>> >> -----
> >> >>>> >>  Author:
> >> >>>> >>
> >> http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
> >> >>>> >> --
> >> >>>> >> View this message in context:
> >> >>>> >>
> >> >>>> >>
> >> >>>>
> >> >>>>
> >> http://lucene.472066.n3.nabble.com/Performance-question-on-Spatial-Sear
> >> >>>>ch
> >> >>>> >>-tp4081150p4081309.html
> >> >>>> >> Sent from the Solr - User mailing list archive at
Nabble.com.
> >> >>>> >>
> >> >>>>
> >> >>>>
> >> >>>
> >> >>
> >>
> >>
> >
> >
> >--
> >- Luis Cappa
>
>


-- 
- Luis Cappa

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message