lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: How to search for phrase "IAE_UPC_0001"
Date Mon, 04 Aug 2014 22:09:48 GMT
The standard tokenizer treats underscore as a valid token character, not a 
delimiter.

The word delimiter filter will treat underscore as a delimiter though.

Make sure your query-time WDF does not have preserveOriginal="1" - but the 
index-time WDF should have preserveOriginal="1". Otherwise, the query phrase 
will generate an extra token which will participate in the matching and 
might cause a mismatch.

-- Jack Krupansky

-----Original Message----- 
From: Paul Rogers
Sent: Monday, August 4, 2014 5:55 PM
To: solr-user@lucene.apache.org
Subject: Re: How to search for phrase "IAE_UPC_0001"

Hi Guys

Thanks for the replies.  I've had a look at the WordDelimiterFilterFactory
and the Term Info for the url field.  It seems that all the terms exist and
I now understand that each url is being broken up using the delimiters
specified.  But I think I'm still missing something.

Am I correct in assuming the minus sign (-) is also a delimiter?

If so why then does  url:"IAE-UPC-0001" return a result (when the url
contains the substring IAE-UPC-0001) whereas  url:"IAE_UPC_0001" doesn't
(when the url contains the substring IAE_UPC_0001)?

Secondly if the url has indeed been broken into the terms IAE UPC and 0001
why do all the searches suggested or tried succeed when the delimiter is a
minus sign (-) but not when the delimiter is an underscore (_), returning
zero matches?

Finally, shouldn't the query url:"IAE UPC 0001"~1 work since all it is
looking for is the three terms?

Many thanks for any enlightenment.

P




On 4 August 2014 01:33, Harald Kirsch <Harald.Kirsch@raytion.com> wrote:

> This all depends on how the tokenizers take your URLs apart. To quickly
> see what ended up in the index, go to a core in the UI, select Schema
> Browser, select the field containing your URLs, click on "Load Term Info".
>
> In your case, for the field holding the URL you could try to switch to a
> tokenizer that defines tokens as a sequence of alphanumeric characters,
> roughly [a-z0-9]+ plus diacritics. In particular punctuation and 
> separation
> characters like dash, underscore, slash, dot and the like would never be
> part of a token, i.e. they don't make a difference.
>
> Then you can search the url parts with a phrase query (
> https://cwiki.apache.org/confluence/display/solr/The+
> Standard+Query+Parser#TheStandardQueryParser-
> SpecifyingTermsfortheStandardQueryParserwhich) like
>
>  url:"IAE-UPC-0001"
>
> In the same way as during indexing, the dashes are removed to end up with
> three tokens, namely IAE, UPC and 0001. Further they have to be in that
> order. Naturally this will then match anything like:
>
>   "IAE_UPC_0001"
>   "IAE UPC 0001"
>   "IAE/UPC+0001"
>   "IAE\UPC\0001"
>   "IAE.UPC,0001"
>
> Depending on how your URLs are structured, there is the chance for false
> positives, of course.
>
> The Really Good Thing here is, that you don't need to use wildcards.
>
> I have not yet looked at the wildcard-queries implementation in
> Solr/Lucene, but with the  commercial search engines I know, they are a
> great way to loose the confidence of your users, because they just don't
> work as expected by anyone not knowing the implementation. Either they
> deliver only partial results or they kill the performance or they even go
> OOM. If Solr committers have not done something really ingenious,
> Solr/Lucene does have the same problems.
>
> Harald.
>
>
>
>
>
>
> On 31.07.2014 18:31, Paul Rogers wrote:
>
>> Hi Guys
>>
>> I have a Solr application searching on data uploaded by Nutch.  The 
>> search
>> I wish to carry out is for a particular document reference contained
>> within
>> the "url" field, e.g. IAE-UPC-0001.
>>
>> The problem is is that the file names that comprise the url's are not
>> consistent, so a url might contain the reference as IAE-UPC-0001 or
>> IAE_UPC_0001 (ie using either the minus or underscore as the delimiter)
>> but
>> not both.
>>
>> I have created the query (in the solr admin interface):
>>
>> url:"IAE-UPC-0001"
>>
>> which works (returning the single expected document), as do:
>>
>> url:"IAE*UPC*0001"
>> url:"IAE?UPC?0001"
>>
>> when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign 
>> as
>> a delimiter).
>>
>> However:
>>
>> url:"IAE_UPC_0001"
>> url:"IAE*UPC*0001"
>> url:"IAE?UPC?0001"
>>
>> do not work (returning zero documents) when the doc ref is in the format
>> IAE_UPC_0001 (ie using the underscore character as the delimiter).
>>
>> I'm assuming the underscore is a special character but have tried looking
>> at the solr wiki but can't find anything to say what the problem is. 
>> Also
>> the minus sign also has a specific meaning but is nullified by adding the
>> quotes.
>>
>> Can anyone suggest what I'm doing wrong?
>>
>> Many thanks
>>
>> Paul
>>
>>
> --
> Harald Kirsch
> Raytion GmbH
> Kaiser-Friedrich-Ring 74
> 40547 Duesseldorf
> Fon +49 211 53883-216
> Fax +49-211-550266-19
> http://www.raytion.com
> 


Mime
View raw message