lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harald Kirsch <Harald.Kir...@raytion.com>
Subject Re: How to search for phrase "IAE_UPC_0001"
Date Mon, 04 Aug 2014 06:33:06 GMT
This all depends on how the tokenizers take your URLs apart. To quickly 
see what ended up in the index, go to a core in the UI, select Schema 
Browser, select the field containing your URLs, click on "Load Term Info".

In your case, for the field holding the URL you could try to switch to a 
tokenizer that defines tokens as a sequence of alphanumeric characters, 
roughly [a-z0-9]+ plus diacritics. In particular punctuation and 
separation characters like dash, underscore, slash, dot and the like 
would never be part of a token, i.e. they don't make a difference.

Then you can search the url parts with a phrase query 
(https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser#TheStandardQueryParser-SpecifyingTermsfortheStandardQueryParserwhich)

like

  url:"IAE-UPC-0001"

In the same way as during indexing, the dashes are removed to end up 
with three tokens, namely IAE, UPC and 0001. Further they have to be in 
that order. Naturally this will then match anything like:

   "IAE_UPC_0001"
   "IAE UPC 0001"
   "IAE/UPC+0001"
   "IAE\UPC\0001"
   "IAE.UPC,0001"

Depending on how your URLs are structured, there is the chance for false 
positives, of course.

The Really Good Thing here is, that you don't need to use wildcards.

I have not yet looked at the wildcard-queries implementation in 
Solr/Lucene, but with the  commercial search engines I know, they are a 
great way to loose the confidence of your users, because they just don't 
work as expected by anyone not knowing the implementation. Either they 
deliver only partial results or they kill the performance or they even 
go OOM. If Solr committers have not done something really ingenious, 
Solr/Lucene does have the same problems.

Harald.





On 31.07.2014 18:31, Paul Rogers wrote:
> Hi Guys
>
> I have a Solr application searching on data uploaded by Nutch.  The search
> I wish to carry out is for a particular document reference contained within
> the "url" field, e.g. IAE-UPC-0001.
>
> The problem is is that the file names that comprise the url's are not
> consistent, so a url might contain the reference as IAE-UPC-0001 or
> IAE_UPC_0001 (ie using either the minus or underscore as the delimiter) but
> not both.
>
> I have created the query (in the solr admin interface):
>
> url:"IAE-UPC-0001"
>
> which works (returning the single expected document), as do:
>
> url:"IAE*UPC*0001"
> url:"IAE?UPC?0001"
>
> when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign as
> a delimiter).
>
> However:
>
> url:"IAE_UPC_0001"
> url:"IAE*UPC*0001"
> url:"IAE?UPC?0001"
>
> do not work (returning zero documents) when the doc ref is in the format
> IAE_UPC_0001 (ie using the underscore character as the delimiter).
>
> I'm assuming the underscore is a special character but have tried looking
> at the solr wiki but can't find anything to say what the problem is.  Also
> the minus sign also has a specific meaning but is nullified by adding the
> quotes.
>
> Can anyone suggest what I'm doing wrong?
>
> Many thanks
>
> Paul
>

-- 
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49 211 53883-216
Fax +49-211-550266-19
http://www.raytion.com

Mime
View raw message