lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Smiley, David W." <dsmi...@mitre.org>
Subject Re: Geospatial search in Lucene/Solr
Date Tue, 28 Dec 2010 15:47:50 GMT
(I was emailing Grant RE geospatial search and it occurred to me I should be doing this on
the Lucene dev list so I've taken it there)

On Dec 28, 2010, at 8:31 AM, Grant Ingersoll wrote:
On Dec 27, 2010, at 11:52 PM, Smiley, David W. wrote:

Hi Grant.

I saw your latest blog post at Lucid Imagination in which you mentioned you were going to
work on adding polygon search.  FWIW, I finished polygon search last week to my code base
based onhttps://issues.apache.org/jira/browse/SOLR-2155 (geohash prefix implementation).

I've scanned this, but haven't had time to look in depth.

I used the JTS library to do the heavy lifting.  I’d be happy to release the code.  I’ve
iterated more on SOLR-2155 on my local project code and plan to re-integrate it with the patch
at some point.  There was almost 2 months of time in-between me releasing SOLR-2155 and me
having other priorities but I’m back at it.

Cool.  The problem w/ JTS is it is LGPL.

At least it's not GPL.  Can you simply use JTS and make it an optional library, like Solr
does for some other libs?  There's a lot of expertise in that library that's been refined
over the last 10 years.  Even if you refuse to touch anything non-Apache licensed, I highly
recommend you look through its code to see how geospatial point-in-polygon is efficiently
done.  It has a concept of a "prepared geometry" object that is optimized to make large numbers
of point-in-polygon queries more efficient.  It is implemented by putting the line segments
of the polygon in an in-memory R-tree index.  If you'd like me to point you at specific classes,
I'd be happy to.  Better yet, I could release an update to SOLR-2155 and you could debug step
through.  FWIW  I used a separate class for the polygon search that implements my GeoShape
interface.  If a user doesn't need to do a polygon search (which is not a common requirement
of geospatial search), then the JTS library need not ship with Lucene/Solr.

Presently, I’m working on Lucene’s benchmark contrib module to evaluate the performance
of SOLR-2155 compared to the LatLon type (i.e. a pair of lat-lon range queries), and then
I’ll work on a more efficient probably non-geohash implementation but based on the same
underlying concept of a hierarchical grid.  I’m using the geonames.org<http://geonames.org/>
data set.  Unfortunately, the benchmark code seems very oriented to a generic title-body document
whereas I’m looking to create lat-lon pairs… and furthermore to create documents containing
multiple lat-lon pairs, and even furthermore a query generator that generates random box queries
centered on a random location from the data set.  I seem to be stretching the benchmark framework
beyond the use-case it was designed for and so perhaps it won’t be committable but at least
I’ll have a patch for other geospatial birds-of-a-feather like you to use.

Stretch away.  The Title/Body orientation is just a relic of what we have done in the past,
it doesn't have to stay that way.

I am interested in your thoughts on evaluating the performance of geospatial queries.  Since
reading LIA2, the Lucene's benchmark contrib module seems like the ideal way to test Lucene.
I thought about programmatically generating points but I've warmed to the idea of using geonames.org<http://geonames.org>
data as the space of possible points.  Geonames has 7.5M unique lat-lon pairs[1], including
population.  In SOLR-2155, there are multiple points per document as a key feature.  For testing,
I want to be able to configure 1 point per document for comparison to algorithms that only
support that but I must support a random variable number of them too.  Consequently, each
place does NOT correspond 1-1 to a document.  The number of documents indexed should be configurable
which will be randomly generated based on randomly picking one or more points from the input
data set.  The number of points available from the input data set should be configurable too
(i.e. < 7.5M).   Assuming that a more populace place is more likely to be referenced than
a less populace one, I want to skew choosing a place weighted by the place's population. 
This creates a much more realistic skew of documents mapped to points than an evenly distributed
one, which is important.  On the query side, I'd like to generate random queries centered
on a random one of these points, with a radius that is either 1km, 10km, 100km, 1000km, or
10000km, in order to match a wide variety of documents.  For reporting, I'd like to see a
chart of response time vs number of documents matched.

I'm perhaps half-done with implementing all this. Because I need to randomly choose points
in the data set, I can't stream it in, I need to read it all into memory as a singleton object
used by both the indexing side, and the query side (since the queries need to pick a random
point).

[1] http://www.geonames.org/about.html

~ David Smiley

Mime
View raw message