lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Geert-Jan Brits <gbr...@gmail.com>
Subject Re: Searching across multiple repeating fields
Date Tue, 22 Jun 2010 14:44:45 GMT
Perhaps my answer is useless, bc I don't have an answer to your direct
question, but:
You *might* want to consider if your concept of a solr-document is on the
correct granular level, i.e:

your problem posted could be tackled (afaik) by defining a  document being a
'sub-event' with only 1 daterange.
So for each event-doc you have now, this is replaced by several sub-event
docs in this proposed situation.

Additionally each sub-event doc gets an additional field 'parent-eventid'
which maps to something like an event-id (which you're probably using) .
So several sub-event docs can point to the same event-id.

Lastly, all sub-event docs belonging to a particular event implement all the
other fields that you may have stored in that particular event-doc.

Now you can query for events based on data-rages like you envisioned, but
instead of returning events you return sub-event-docs. However since all
data of the original event (except the multiple dateranges) is available in
the subevent-doc this shouldn't really bother the client. If you need to
display all dates of an event (the only info missing from the returned
solr-doc) you could easily store it in a RDB and fetch it using the defined
parent-eventid.

The only caveat I see, is that possibly multiple sub-events with the same
'parent-eventid' might get returned for a particular query.
This however depends on the type of queries you envision. i.e:
1)  If you always issue queries with date-filters, and *assuming* that
sub-events of a particular event don't temporally overlap, you will never
get multiple sub-events returned.
2)  if 1)  doesn't hold and assuming you *do* mind multiple sub-events of
the same actual event, you could try to use Field Collapsing on
'parent-eventid' to only return the first sub-event per parent-eventid that
matches the rest of your query. (Note however, that Field Collapsing is a
patch at the moment. http://wiki.apache.org/solr/FieldCollapsing)

Not sure if this helped you at all, but at the very least it was a nice
conceptual exercise ;-)

Cheers,
Geert-Jan


2010/6/22 Mark Allan <mark.allan@ed.ac.uk>

> Hi all,
>
> Firstly, I apologise for the length of this email but I need to describe
> properly what I'm doing before I get to the problem!
>
> I'm working on a project just now which requires the ability to store and
> search on temporal coverage data - ie. a field which specifies a date range
> during which a certain event took place.
>
> I hunted around for a few days and couldn't find anything which seemed to
> fit, so I had a go at writing my own field type based on solr.PointType.
>  It's used as follows:
>  schema.xml
>        <fieldType name="temporal" class="solr.TemporalCoverage"
> dimension="2" subFieldSuffix="_i"/>
>        <field name="daterange" type="temporal" indexed="true" stored="true"
> multiValued="true"/>
>  data.xml
>        <add>
>        <doc>
>        ...
>        <field name="daterange">1940,1945</field>
>        </doc>
>        </add>
>
> Internally, this gets stored as:
>    <arr name="daterange"><str>1940,1945</str></arr>
>    <int name="daterange_0_i">19400000</int>
>    <int name="daterange_1_i">19450000</int>
>
> In due course, I'll declare the subfields as a proper date type, but in the
> meantime, this works absolutely fine.  I can search for an individual date
> and Solr will check (queryDate > daterange_0 AND queryDate < daterange_1 )
> and the correct documents are returned.  My code also allows the user to
> input a date range in the query but I won't complicate matters with that
> just now!
>
> The problem arises when a document has more than one "daterange" field
> (imagine a news broadcast which covers a variety of topics and hence time
> periods).
>
> A document with two daterange fields
>        <doc>
>        ...
>        <field name="daterange">19820402,19820614</field>
>        <field name="daterange">1990,2000</field>
>        </doc>
> gets stored internally as
>    <arr
> name="daterange"><str>19820402,19820614</str><str>1990,2000</str></arr>
>    <arr name="daterange_0_i"><int>19820402</int><int>19900000</int></arr>
>    <arr name="daterange_1_i"><int>19820614</int><int>20000000</int></arr>
>
> In this situation, searching for 1985 should yield zero results as it is
> contained within neither daterange, however, the above document is returned
> in the result set.  What Solr is doing is checking that the queryDate (1985)
> is greater than *any* of the values in daterange_0 AND queryDate is less
> than *any* of the values in daterange_1.
>
> How can I get Solr to respect the positions of each item in the daterange_0
> and _1 arrays?  Ideally I'd like the search to use the following logic, thus
> preventing the above document from being returned in a search for 1985:
>        (queryDate > daterange_0[0] AND queryDate < daterange_1[0]) OR
> (queryDate > daterange_0[1] AND queryDate < daterange_1[1])
>
> Someone else had a very similar problem recently on the mailing list with a
> multiValued PointType field but the thread went cold without a final
> solution.
>
> While I could filter the results when they get back to my application
> layer, it seems like it's not really the right place to do it.
>
> Any help getting Solr to respect the positions of items in arrays would be
> very gratefully received.
>
> Many thanks,
> Mark
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message