lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Allan <mark.al...@ed.ac.uk>
Subject Re: Searching across multiple repeating fields
Date Wed, 23 Jun 2010 08:52:03 GMT
Cheers, Geert-Jan, that's very helpful.

We won't always be searching with dates and we wouldn't want  
duplicates to show up in the results, so your second suggestion looks  
like a good workaround if I can't solve the actual problem.  I didn't  
know about FieldCollapsing, so I'll definitely keep it in mind.

Thanks
Mark

On 22 Jun 2010, at 3:44 pm, Geert-Jan Brits wrote:

> Perhaps my answer is useless, bc I don't have an answer to your direct
> question, but:
> You *might* want to consider if your concept of a solr-document is  
> on the
> correct granular level, i.e:
>
> your problem posted could be tackled (afaik) by defining a  document  
> being a
> 'sub-event' with only 1 daterange.
> So for each event-doc you have now, this is replaced by several sub- 
> event
> docs in this proposed situation.
>
> Additionally each sub-event doc gets an additional field 'parent- 
> eventid'
> which maps to something like an event-id (which you're probably  
> using) .
> So several sub-event docs can point to the same event-id.
>
> Lastly, all sub-event docs belonging to a particular event implement  
> all the
> other fields that you may have stored in that particular event-doc.
>
> Now you can query for events based on data-rages like you  
> envisioned, but
> instead of returning events you return sub-event-docs. However since  
> all
> data of the original event (except the multiple dateranges) is  
> available in
> the subevent-doc this shouldn't really bother the client. If you  
> need to
> display all dates of an event (the only info missing from the returned
> solr-doc) you could easily store it in a RDB and fetch it using the  
> defined
> parent-eventid.
>
> The only caveat I see, is that possibly multiple sub-events with the  
> same
> 'parent-eventid' might get returned for a particular query.
> This however depends on the type of queries you envision. i.e:
> 1)  If you always issue queries with date-filters, and *assuming* that
> sub-events of a particular event don't temporally overlap, you will  
> never
> get multiple sub-events returned.
> 2)  if 1)  doesn't hold and assuming you *do* mind multiple sub- 
> events of
> the same actual event, you could try to use Field Collapsing on
> 'parent-eventid' to only return the first sub-event per parent- 
> eventid that
> matches the rest of your query. (Note however, that Field Collapsing  
> is a
> patch at the moment. http://wiki.apache.org/solr/FieldCollapsing)
>
> Not sure if this helped you at all, but at the very least it was a  
> nice
> conceptual exercise ;-)
>
> Cheers,
> Geert-Jan
>
>
> 2010/6/22 Mark Allan <mark.allan@ed.ac.uk>
>
>> Hi all,
>>
>> Firstly, I apologise for the length of this email but I need to  
>> describe
>> properly what I'm doing before I get to the problem!
>>
>> I'm working on a project just now which requires the ability to  
>> store and
>> search on temporal coverage data - ie. a field which specifies a  
>> date range
>> during which a certain event took place.
>>
>> I hunted around for a few days and couldn't find anything which  
>> seemed to
>> fit, so I had a go at writing my own field type based on  
>> solr.PointType.
>> It's used as follows:
>> schema.xml
>>       <fieldType name="temporal" class="solr.TemporalCoverage"
>> dimension="2" subFieldSuffix="_i"/>
>>       <field name="daterange" type="temporal" indexed="true"  
>> stored="true"
>> multiValued="true"/>
>> data.xml
>>       <add>
>>       <doc>
>>       ...
>>       <field name="daterange">1940,1945</field>
>>       </doc>
>>       </add>
>>
>> Internally, this gets stored as:
>>   <arr name="daterange"><str>1940,1945</str></arr>
>>   <int name="daterange_0_i">19400000</int>
>>   <int name="daterange_1_i">19450000</int>
>>
>> In due course, I'll declare the subfields as a proper date type,  
>> but in the
>> meantime, this works absolutely fine.  I can search for an  
>> individual date
>> and Solr will check (queryDate > daterange_0 AND queryDate <  
>> daterange_1 )
>> and the correct documents are returned.  My code also allows the  
>> user to
>> input a date range in the query but I won't complicate matters with  
>> that
>> just now!
>>
>> The problem arises when a document has more than one "daterange"  
>> field
>> (imagine a news broadcast which covers a variety of topics and  
>> hence time
>> periods).
>>
>> A document with two daterange fields
>>       <doc>
>>       ...
>>       <field name="daterange">19820402,19820614</field>
>>       <field name="daterange">1990,2000</field>
>>       </doc>
>> gets stored internally as
>>   <arr
>> name="daterange"><str>19820402,19820614</str><str>1990,2000</str></

>> arr>
>>   <arr name="daterange_0_i"><int>19820402</int><int>19900000</int></

>> arr>
>>   <arr name="daterange_1_i"><int>19820614</int><int>20000000</int></

>> arr>
>>
>> In this situation, searching for 1985 should yield zero results as  
>> it is
>> contained within neither daterange, however, the above document is  
>> returned
>> in the result set.  What Solr is doing is checking that the  
>> queryDate (1985)
>> is greater than *any* of the values in daterange_0 AND queryDate is  
>> less
>> than *any* of the values in daterange_1.
>>
>> How can I get Solr to respect the positions of each item in the  
>> daterange_0
>> and _1 arrays?  Ideally I'd like the search to use the following  
>> logic, thus
>> preventing the above document from being returned in a search for  
>> 1985:
>>       (queryDate > daterange_0[0] AND queryDate < daterange_1[0]) OR
>> (queryDate > daterange_0[1] AND queryDate < daterange_1[1])
>>
>> Someone else had a very similar problem recently on the mailing  
>> list with a
>> multiValued PointType field but the thread went cold without a final
>> solution.
>>
>> While I could filter the results when they get back to my application
>> layer, it seems like it's not really the right place to do it.
>>
>> Any help getting Solr to respect the positions of items in arrays  
>> would be
>> very gratefully received.
>>
>> Many thanks,
>> Mark


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Mime
View raw message