lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <paul.roge...@gmail.com>
Subject Re: How to search for phrase "IAE_UPC_0001"
Date Mon, 18 Aug 2014 19:31:39 GMT
Hi Erick

Thanks for the assist.  Did as you suggested (tho' I used Nutch).  Cleared
out solr's index and Nutch's crawl DB and then emptied all the documents
out of the web server bar 10 of each type (IAE-UPC-#### and IAE_UPC_####).
 Then crawled the site using Nutch.

Then confirmed that all 20 docs had been uploaded and that *.* search
returned all 20 docs.

Now when I do a url search on either (for example) q=url:"IAE-UPC-220" or
q="IAE_UPC_0001" I get a result returned for each as expected, ie it now
works as expected.

So seems I now need to figure out why Nutch isn't crawling the documents.

Again many thanks.

P




On 18 August 2014 11:22, Erick Erickson <erickerickson@gmail.com> wrote:

> I'd pull Nutch out of the mix here as a test. Create
> some test docs (use the exampleDocs directory?) and
> go from there at least long enough to insure that Solr
> does what you expect if the data gets there properly.
>
> You can set this up in about 10 minutes, and test it
> in about 15 more. May save you endless hours.
>
> Because you're conflating two issues here:
> 1> whether Nutch is sending the data
> 2> whether Solr is indexing and searching as you expect.
>
> Some of the Solr/Lucene analysis chains do transformations
> that may not be what you assume, particularly things
> like StandardTokenizer and WordDelimiterFilterFactory.
>
> So I'd take the time to see that the values you're dealing
> with are behaving as you expect. The admin/analysis page
> will help you a _lot_ here.
>
> Best,
> Erick
>
>
>
>
> On Mon, Aug 18, 2014 at 7:16 AM, Paul Rogers <paul.rogers6@gmail.com>
> wrote:
> > Hi Guys
> >
> > I've been checking into this further and have deleted the index a couple
> of
> > times and rebuilt it with the suggestions you've supplied.
> >
> > I had a bit of an epiphany last week and decided to check if the
> document I
> > was searching for was actually in the index (did this by doing a *.*
> query
> > to a file and grep'ing for the 'IAE_UPC_0001@ string).  It seems it
> isn't!!
> > Not sure if it was in the original index or not, tho' I suspect not.
> >
> > As far as I can see anything with the reference in the form IAE_UPC_####
> > has not been indexed while those with the reference in the form
> > IAE-UPC-#### has.  Not sure if that's a coincidence or not.
> >
> > Need to see if I can get the docs into the index and then check if the
> > search works or not.  Will see if the guys on the Nutch list can shed any
> > light.
> >
> > All the best.
> >
> > P
> >
> >
> > On 4 August 2014 17:09, Jack Krupansky <jack@basetechnology.com> wrote:
> >
> >> The standard tokenizer treats underscore as a valid token character,
> not a
> >> delimiter.
> >>
> >> The word delimiter filter will treat underscore as a delimiter though.
> >>
> >> Make sure your query-time WDF does not have preserveOriginal="1" - but
> the
> >> index-time WDF should have preserveOriginal="1". Otherwise, the query
> >> phrase will generate an extra token which will participate in the
> matching
> >> and might cause a mismatch.
> >>
> >> -- Jack Krupansky
> >>
> >> -----Original Message----- From: Paul Rogers
> >> Sent: Monday, August 4, 2014 5:55 PM
> >>
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: How to search for phrase "IAE_UPC_0001"
> >>
> >> Hi Guys
> >>
> >> Thanks for the replies.  I've had a look at the
> WordDelimiterFilterFactory
> >> and the Term Info for the url field.  It seems that all the terms exist
> and
> >> I now understand that each url is being broken up using the delimiters
> >> specified.  But I think I'm still missing something.
> >>
> >> Am I correct in assuming the minus sign (-) is also a delimiter?
> >>
> >> If so why then does  url:"IAE-UPC-0001" return a result (when the url
> >> contains the substring IAE-UPC-0001) whereas  url:"IAE_UPC_0001" doesn't
> >> (when the url contains the substring IAE_UPC_0001)?
> >>
> >> Secondly if the url has indeed been broken into the terms IAE UPC and
> 0001
> >> why do all the searches suggested or tried succeed when the delimiter
> is a
> >> minus sign (-) but not when the delimiter is an underscore (_),
> returning
> >> zero matches?
> >>
> >> Finally, shouldn't the query url:"IAE UPC 0001"~1 work since all it is
> >> looking for is the three terms?
> >>
> >> Many thanks for any enlightenment.
> >>
> >> P
> >>
> >>
> >>
> >>
> >> On 4 August 2014 01:33, Harald Kirsch <Harald.Kirsch@raytion.com>
> wrote:
> >>
> >>  This all depends on how the tokenizers take your URLs apart. To quickly
> >>> see what ended up in the index, go to a core in the UI, select Schema
> >>> Browser, select the field containing your URLs, click on "Load Term
> Info".
> >>>
> >>> In your case, for the field holding the URL you could try to switch to
> a
> >>> tokenizer that defines tokens as a sequence of alphanumeric characters,
> >>> roughly [a-z0-9]+ plus diacritics. In particular punctuation and
> >>> separation
> >>> characters like dash, underscore, slash, dot and the like would never
> be
> >>> part of a token, i.e. they don't make a difference.
> >>>
> >>> Then you can search the url parts with a phrase query (
> >>> https://cwiki.apache.org/confluence/display/solr/The+
> >>> Standard+Query+Parser#TheStandardQueryParser-
> >>> SpecifyingTermsfortheStandardQueryParserwhich) like
> >>>
> >>>  url:"IAE-UPC-0001"
> >>>
> >>> In the same way as during indexing, the dashes are removed to end up
> with
> >>> three tokens, namely IAE, UPC and 0001. Further they have to be in that
> >>> order. Naturally this will then match anything like:
> >>>
> >>>   "IAE_UPC_0001"
> >>>   "IAE UPC 0001"
> >>>   "IAE/UPC+0001"
> >>>   "IAE\UPC\0001"
> >>>   "IAE.UPC,0001"
> >>>
> >>> Depending on how your URLs are structured, there is the chance for
> false
> >>> positives, of course.
> >>>
> >>> The Really Good Thing here is, that you don't need to use wildcards.
> >>>
> >>> I have not yet looked at the wildcard-queries implementation in
> >>> Solr/Lucene, but with the  commercial search engines I know, they are a
> >>> great way to loose the confidence of your users, because they just
> don't
> >>> work as expected by anyone not knowing the implementation. Either they
> >>> deliver only partial results or they kill the performance or they even
> go
> >>> OOM. If Solr committers have not done something really ingenious,
> >>> Solr/Lucene does have the same problems.
> >>>
> >>> Harald.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On 31.07.2014 18:31, Paul Rogers wrote:
> >>>
> >>>  Hi Guys
> >>>>
> >>>> I have a Solr application searching on data uploaded by Nutch.  The
> >>>> search
> >>>> I wish to carry out is for a particular document reference contained
> >>>> within
> >>>> the "url" field, e.g. IAE-UPC-0001.
> >>>>
> >>>> The problem is is that the file names that comprise the url's are not
> >>>> consistent, so a url might contain the reference as IAE-UPC-0001 or
> >>>> IAE_UPC_0001 (ie using either the minus or underscore as the
> delimiter)
> >>>> but
> >>>> not both.
> >>>>
> >>>> I have created the query (in the solr admin interface):
> >>>>
> >>>> url:"IAE-UPC-0001"
> >>>>
> >>>> which works (returning the single expected document), as do:
> >>>>
> >>>> url:"IAE*UPC*0001"
> >>>> url:"IAE?UPC?0001"
> >>>>
> >>>> when the doc ref is in the format IAE-UPC-0001 (ie using the minus
> sign
> >>>> as
> >>>> a delimiter).
> >>>>
> >>>> However:
> >>>>
> >>>> url:"IAE_UPC_0001"
> >>>> url:"IAE*UPC*0001"
> >>>> url:"IAE?UPC?0001"
> >>>>
> >>>> do not work (returning zero documents) when the doc ref is in the
> format
> >>>> IAE_UPC_0001 (ie using the underscore character as the delimiter).
> >>>>
> >>>> I'm assuming the underscore is a special character but have tried
> looking
> >>>> at the solr wiki but can't find anything to say what the problem is.
> Also
> >>>> the minus sign also has a specific meaning but is nullified by adding
> the
> >>>> quotes.
> >>>>
> >>>> Can anyone suggest what I'm doing wrong?
> >>>>
> >>>> Many thanks
> >>>>
> >>>> Paul
> >>>>
> >>>>
> >>>>  --
> >>> Harald Kirsch
> >>> Raytion GmbH
> >>> Kaiser-Friedrich-Ring 74
> >>> 40547 Duesseldorf
> >>> Fon +49 211 53883-216
> >>> Fax +49-211-550266-19
> >>> http://www.raytion.com
> >>>
> >>>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message