lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Allan <mark.al...@ed.ac.uk>
Subject Searching across multiple repeating fields
Date Tue, 22 Jun 2010 13:53:06 GMT
Hi all,

Firstly, I apologise for the length of this email but I need to  
describe properly what I'm doing before I get to the problem!

I'm working on a project just now which requires the ability to store  
and search on temporal coverage data - ie. a field which specifies a  
date range during which a certain event took place.

I hunted around for a few days and couldn't find anything which seemed  
to fit, so I had a go at writing my own field type based on  
solr.PointType.  It's used as follows:
   schema.xml
	<fieldType name="temporal" class="solr.TemporalCoverage"  
dimension="2" subFieldSuffix="_i"/>
	<field name="daterange" type="temporal" indexed="true" stored="true"  
multiValued="true"/>
   data.xml
	<add>
	<doc>
	...
	<field name="daterange">1940,1945</field>
	</doc>
	</add>

Internally, this gets stored as:
     <arr name="daterange"><str>1940,1945</str></arr>
     <int name="daterange_0_i">19400000</int>
     <int name="daterange_1_i">19450000</int>

In due course, I'll declare the subfields as a proper date type, but  
in the meantime, this works absolutely fine.  I can search for an  
individual date and Solr will check (queryDate > daterange_0 AND  
queryDate < daterange_1 ) and the correct documents are returned.  My  
code also allows the user to input a date range in the query but I  
won't complicate matters with that just now!

The problem arises when a document has more than one "daterange" field  
(imagine a news broadcast which covers a variety of topics and hence  
time periods).

A document with two daterange fields
	<doc>
	...
	<field name="daterange">19820402,19820614</field>
	<field name="daterange">1990,2000</field>
	</doc>
gets stored internally as
     <arr name="daterange"><str>19820402,19820614</str><str>1990,2000</

str></arr>
     <arr name="daterange_0_i"><int>19820402</int><int>19900000</int></

arr>
     <arr name="daterange_1_i"><int>19820614</int><int>20000000</int></

arr>

In this situation, searching for 1985 should yield zero results as it  
is contained within neither daterange, however, the above document is  
returned in the result set.  What Solr is doing is checking that the  
queryDate (1985) is greater than *any* of the values in daterange_0  
AND queryDate is less than *any* of the values in daterange_1.

How can I get Solr to respect the positions of each item in the  
daterange_0 and _1 arrays?  Ideally I'd like the search to use the  
following logic, thus preventing the above document from being  
returned in a search for 1985:
	(queryDate > daterange_0[0] AND queryDate < daterange_1[0]) OR  
(queryDate > daterange_0[1] AND queryDate < daterange_1[1])

Someone else had a very similar problem recently on the mailing list  
with a multiValued PointType field but the thread went cold without a  
final solution.

While I could filter the results when they get back to my application  
layer, it seems like it's not really the right place to do it.

Any help getting Solr to respect the positions of items in arrays  
would be very gratefully received.

Many thanks,
Mark


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Mime
View raw message