lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: How to search for phrase "IAE_UPC_0001"
Date Mon, 18 Aug 2014 21:58:19 GMT
NP, glad you're making forward progress!

Erick

On Mon, Aug 18, 2014 at 12:31 PM, Paul Rogers <paul.rogers6@gmail.com> wrote:
> Hi Erick
>
> Thanks for the assist.  Did as you suggested (tho' I used Nutch).  Cleared
> out solr's index and Nutch's crawl DB and then emptied all the documents
> out of the web server bar 10 of each type (IAE-UPC-#### and IAE_UPC_####).
>  Then crawled the site using Nutch.
>
> Then confirmed that all 20 docs had been uploaded and that *.* search
> returned all 20 docs.
>
> Now when I do a url search on either (for example) q=url:"IAE-UPC-220" or
> q="IAE_UPC_0001" I get a result returned for each as expected, ie it now
> works as expected.
>
> So seems I now need to figure out why Nutch isn't crawling the documents.
>
> Again many thanks.
>
> P
>
>
>
>
> On 18 August 2014 11:22, Erick Erickson <erickerickson@gmail.com> wrote:
>
>> I'd pull Nutch out of the mix here as a test. Create
>> some test docs (use the exampleDocs directory?) and
>> go from there at least long enough to insure that Solr
>> does what you expect if the data gets there properly.
>>
>> You can set this up in about 10 minutes, and test it
>> in about 15 more. May save you endless hours.
>>
>> Because you're conflating two issues here:
>> 1> whether Nutch is sending the data
>> 2> whether Solr is indexing and searching as you expect.
>>
>> Some of the Solr/Lucene analysis chains do transformations
>> that may not be what you assume, particularly things
>> like StandardTokenizer and WordDelimiterFilterFactory.
>>
>> So I'd take the time to see that the values you're dealing
>> with are behaving as you expect. The admin/analysis page
>> will help you a _lot_ here.
>>
>> Best,
>> Erick
>>
>>
>>
>>
>> On Mon, Aug 18, 2014 at 7:16 AM, Paul Rogers <paul.rogers6@gmail.com>
>> wrote:
>> > Hi Guys
>> >
>> > I've been checking into this further and have deleted the index a couple
>> of
>> > times and rebuilt it with the suggestions you've supplied.
>> >
>> > I had a bit of an epiphany last week and decided to check if the
>> document I
>> > was searching for was actually in the index (did this by doing a *.*
>> query
>> > to a file and grep'ing for the 'IAE_UPC_0001@ string).  It seems it
>> isn't!!
>> > Not sure if it was in the original index or not, tho' I suspect not.
>> >
>> > As far as I can see anything with the reference in the form IAE_UPC_####
>> > has not been indexed while those with the reference in the form
>> > IAE-UPC-#### has.  Not sure if that's a coincidence or not.
>> >
>> > Need to see if I can get the docs into the index and then check if the
>> > search works or not.  Will see if the guys on the Nutch list can shed any
>> > light.
>> >
>> > All the best.
>> >
>> > P
>> >
>> >
>> > On 4 August 2014 17:09, Jack Krupansky <jack@basetechnology.com> wrote:
>> >
>> >> The standard tokenizer treats underscore as a valid token character,
>> not a
>> >> delimiter.
>> >>
>> >> The word delimiter filter will treat underscore as a delimiter though.
>> >>
>> >> Make sure your query-time WDF does not have preserveOriginal="1" - but
>> the
>> >> index-time WDF should have preserveOriginal="1". Otherwise, the query
>> >> phrase will generate an extra token which will participate in the
>> matching
>> >> and might cause a mismatch.
>> >>
>> >> -- Jack Krupansky
>> >>
>> >> -----Original Message----- From: Paul Rogers
>> >> Sent: Monday, August 4, 2014 5:55 PM
>> >>
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: How to search for phrase "IAE_UPC_0001"
>> >>
>> >> Hi Guys
>> >>
>> >> Thanks for the replies.  I've had a look at the
>> WordDelimiterFilterFactory
>> >> and the Term Info for the url field.  It seems that all the terms exist
>> and
>> >> I now understand that each url is being broken up using the delimiters
>> >> specified.  But I think I'm still missing something.
>> >>
>> >> Am I correct in assuming the minus sign (-) is also a delimiter?
>> >>
>> >> If so why then does  url:"IAE-UPC-0001" return a result (when the url
>> >> contains the substring IAE-UPC-0001) whereas  url:"IAE_UPC_0001" doesn't
>> >> (when the url contains the substring IAE_UPC_0001)?
>> >>
>> >> Secondly if the url has indeed been broken into the terms IAE UPC and
>> 0001
>> >> why do all the searches suggested or tried succeed when the delimiter
>> is a
>> >> minus sign (-) but not when the delimiter is an underscore (_),
>> returning
>> >> zero matches?
>> >>
>> >> Finally, shouldn't the query url:"IAE UPC 0001"~1 work since all it is
>> >> looking for is the three terms?
>> >>
>> >> Many thanks for any enlightenment.
>> >>
>> >> P
>> >>
>> >>
>> >>
>> >>
>> >> On 4 August 2014 01:33, Harald Kirsch <Harald.Kirsch@raytion.com>
>> wrote:
>> >>
>> >>  This all depends on how the tokenizers take your URLs apart. To quickly
>> >>> see what ended up in the index, go to a core in the UI, select Schema
>> >>> Browser, select the field containing your URLs, click on "Load Term
>> Info".
>> >>>
>> >>> In your case, for the field holding the URL you could try to switch
to
>> a
>> >>> tokenizer that defines tokens as a sequence of alphanumeric characters,
>> >>> roughly [a-z0-9]+ plus diacritics. In particular punctuation and
>> >>> separation
>> >>> characters like dash, underscore, slash, dot and the like would never
>> be
>> >>> part of a token, i.e. they don't make a difference.
>> >>>
>> >>> Then you can search the url parts with a phrase query (
>> >>> https://cwiki.apache.org/confluence/display/solr/The+
>> >>> Standard+Query+Parser#TheStandardQueryParser-
>> >>> SpecifyingTermsfortheStandardQueryParserwhich) like
>> >>>
>> >>>  url:"IAE-UPC-0001"
>> >>>
>> >>> In the same way as during indexing, the dashes are removed to end up
>> with
>> >>> three tokens, namely IAE, UPC and 0001. Further they have to be in that
>> >>> order. Naturally this will then match anything like:
>> >>>
>> >>>   "IAE_UPC_0001"
>> >>>   "IAE UPC 0001"
>> >>>   "IAE/UPC+0001"
>> >>>   "IAE\UPC\0001"
>> >>>   "IAE.UPC,0001"
>> >>>
>> >>> Depending on how your URLs are structured, there is the chance for
>> false
>> >>> positives, of course.
>> >>>
>> >>> The Really Good Thing here is, that you don't need to use wildcards.
>> >>>
>> >>> I have not yet looked at the wildcard-queries implementation in
>> >>> Solr/Lucene, but with the  commercial search engines I know, they are
a
>> >>> great way to loose the confidence of your users, because they just
>> don't
>> >>> work as expected by anyone not knowing the implementation. Either they
>> >>> deliver only partial results or they kill the performance or they even
>> go
>> >>> OOM. If Solr committers have not done something really ingenious,
>> >>> Solr/Lucene does have the same problems.
>> >>>
>> >>> Harald.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On 31.07.2014 18:31, Paul Rogers wrote:
>> >>>
>> >>>  Hi Guys
>> >>>>
>> >>>> I have a Solr application searching on data uploaded by Nutch. 
The
>> >>>> search
>> >>>> I wish to carry out is for a particular document reference contained
>> >>>> within
>> >>>> the "url" field, e.g. IAE-UPC-0001.
>> >>>>
>> >>>> The problem is is that the file names that comprise the url's are
not
>> >>>> consistent, so a url might contain the reference as IAE-UPC-0001
or
>> >>>> IAE_UPC_0001 (ie using either the minus or underscore as the
>> delimiter)
>> >>>> but
>> >>>> not both.
>> >>>>
>> >>>> I have created the query (in the solr admin interface):
>> >>>>
>> >>>> url:"IAE-UPC-0001"
>> >>>>
>> >>>> which works (returning the single expected document), as do:
>> >>>>
>> >>>> url:"IAE*UPC*0001"
>> >>>> url:"IAE?UPC?0001"
>> >>>>
>> >>>> when the doc ref is in the format IAE-UPC-0001 (ie using the minus
>> sign
>> >>>> as
>> >>>> a delimiter).
>> >>>>
>> >>>> However:
>> >>>>
>> >>>> url:"IAE_UPC_0001"
>> >>>> url:"IAE*UPC*0001"
>> >>>> url:"IAE?UPC?0001"
>> >>>>
>> >>>> do not work (returning zero documents) when the doc ref is in the
>> format
>> >>>> IAE_UPC_0001 (ie using the underscore character as the delimiter).
>> >>>>
>> >>>> I'm assuming the underscore is a special character but have tried
>> looking
>> >>>> at the solr wiki but can't find anything to say what the problem
is.
>> Also
>> >>>> the minus sign also has a specific meaning but is nullified by adding
>> the
>> >>>> quotes.
>> >>>>
>> >>>> Can anyone suggest what I'm doing wrong?
>> >>>>
>> >>>> Many thanks
>> >>>>
>> >>>> Paul
>> >>>>
>> >>>>
>> >>>>  --
>> >>> Harald Kirsch
>> >>> Raytion GmbH
>> >>> Kaiser-Friedrich-Ring 74
>> >>> 40547 Duesseldorf
>> >>> Fon +49 211 53883-216
>> >>> Fax +49-211-550266-19
>> >>> http://www.raytion.com
>> >>>
>> >>>
>> >>
>>

Mime
View raw message