lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Searching for tokens does not return any results
Date Fri, 02 May 2014 21:56:59 GMT
Glad to hear it!

You shouldn't really have to customize the analyzer to get it to
behave as it would if you just used Solr to ingest documents, just
chain things together. That's what Solr does after all. Of course you
may have special needs that are better served by more customization.

TermsComponent is a useful tool. Note that you also get raw terms if
you use the admin/schema-browser page, identify your field, and then
click the "show term info" button. That technique is somewhat limited
though. The schema-browser page is especially useful for very small
indexes and/or test cases I'll admit. I do vaguely remember something
not right with the schema-browser at one point though, so it might not
work as I expect for 4.4

Best,
Erick

On Fri, May 2, 2014 at 1:56 PM, Yetkin Ozkucur <Yetkin.Ozkucur@asg.com> wrote:
> Erick, Koji, Ahmet:
>
> Thank you all for your answers! I think I found the problem and I am on the right track
to fix it.
>
> 1- As you suggested the problem was in the Java code populating the index. The analyzer
in the Java code had to be consistent with the one defined in SOLR. I was able to achieve
my goal by creating a slightly customized analyzer.
> 2- To be able to see the tokens in the index was key to debug the problem. I downloaded
Luke (well a tweaked version of it for lucene 4.4) to be able to see tokens. I did not know
SOLR had that terms component. That is a good tip too.
>
> Have a good weekend.
>
> Thanks,
> Yetkin
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Friday, May 02, 2014 11:57 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Searching for tokens does not return any results
>
> bq:  but this index was created using a Java program using Lucene interface
>
> Elaborating a bit on Koji's comment...
>
> The fact that you used Lucene to index the doc means that the analysis page is almost,
but not quite entirely, useless on the indexing side.
> It's looking at your field definition in schema.xml and running your input stream through
the indexing portion of your analysis chain constructed from the schema. What's actually in
your index though was put there by raw Lucene. So your Lucene program _must_ create an analysis
chain that is absolutely identical to what's in your schema for the admin/analysis page to
be accurate.
>
> Quick test: go to you "admin/schema browser" page or use the TermsComponent (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
> or Luke to examine the actual tokens in your field. My bet is that you'll see that the
actual terms are not what you expect and almost certainly not what the admin/analysis page
shows on the index side.
>
> Keeping an independent Lucene program that puts data into your index with raw Lucene
aligned with your schema is, as you can see, something of a problem. If at all possible, consider
letting Solr do the indexing and sending it documents with SolrJ, here's a reference:
> https://cwiki.apache.org/confluence/display/solr/Using+SolrJ
>
> By the way, I want to compliment you on your post. You did all the right things:
>> defined your problem clearly
>> added the critical bit (index created with Lucene). This is especially
>> relevant I think illustrated the input and output told us what the
>> problem was gave us the field definitions showed the results of some
>> of your investigation
>
> Best
> Erick
>
> On Thu, May 1, 2014 at 7:31 AM, Koji Sekiguchi <koji@r.email.ne.jp> wrote:
>> Hi Yetkin, welcome!
>>
>> I think StandardAnalyzer of Lucene is the problem you are facing.
>>
>> Why don't you have another field using StandardAnalyzer and see how it
>> tokenizes CRD_PROD on Solr admin GUI?
>>
>> I forgot in the detail but we can use Lucene's Analyzer in schema.xml
>> something like this:
>>
>> <fieldType ...>
>>    <analyzer class="solr.StandardAnalyzer"/> </fieldType>
>>
>> Koji
>> --
>> http://soleami.com/blog/comparing-document-classification-functions-of
>> -lucene-and-mahout.html
>>
>>
>> (2014/05/01 23:04), Yetkin Ozkucur wrote:
>>>
>>> Hello everyone,
>>>
>>> I am new to SOLR and this is my first post in this list.
>>> I have been working on this problem for a couple of days. I tried
>>> everything which I found in google but it looks like I am missing something.
>>>
>>> Here is my problem:
>>> I have a field called: DBASE_LOCAT_NM_TEXT It contains values like:
>>> CRD_PROD The goal is to be able to search this field either by
>>> putting the exact string "CRD_PROD" or part of it (tokenized by "_")
>>> like "CRD" or "PROD"
>>>
>>> Currently:
>>> This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD But this
>>> does not: q=DBASE_LOCAT_NM_TEXT:CRD I want to understand why the
>>> second query does not return any results
>>>
>>> Here is how I configured the field:
>>> <field name="DBASE_LOCAT_NM_TEXT" type="text_general" indexed="true"
>>> stored="true" required="false" multiValued="false"/>
>>>
>>> And Here is how I configured the field type :
>>>      <fieldType name="text_general" class="solr.TextField"
>>> positionIncrementGap="100">
>>>        <analyzer type="index">
>>>        <filter class="solr.WordDelimiterFilterFactory"
>>> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
>>> catenateWords="1" catenateNumbers="1" catenateAll="0"
>>> splitOnCaseChange="1"/>
>>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>          <filter class="solr.StopFilterFactory"  ignoreCase="true"
>>> words="stopwords.txt"/>
>>>           <filter class="solr.LowerCaseFilterFactory"/>
>>>          <filter class="solr.KeywordMarkerFilterFactory"
>>> protected="protwords.txt"/>
>>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>        </analyzer>
>>>        <analyzer type="query">
>>>          <filter class="solr.WordDelimiterFilterFactory"
>>> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
>>> catenateWords="0" catenateNumbers="0" catenateAll="0"
>>> splitOnCaseChange="1"/>
>>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt"/>
>>>
>>>          <filter class="solr.LowerCaseFilterFactory"/>
>>>          <filter class="solr.KeywordMarkerFilterFactory"
>>> protected="protwords.txt"/>
>>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>
>>>        </analyzer>
>>>      </fieldType>
>>>
>>> I am also using the analysis panel in the SOLR admin console. It
>>> shows
>>> this:
>>> WT      CRD_PROD
>>>
>>> WDF     CRD_PROD
>>>         CRD
>>>         PROD
>>>         CRDPROD
>>>
>>> SF      CRD_PROD
>>>         CRD
>>>         PROD
>>>         CRDPROD
>>>
>>> LCF     crd_prod
>>>         crd
>>>         prod
>>>         crdprod
>>>
>>> SKMF    crd_prod
>>>         crd
>>>         prod
>>>         crdprod
>>>
>>> RDTF    crd_prod
>>>         crd
>>>         prod
>>>         crdprod
>>>
>>>
>>> I am not sure if it is related or not but this index was created
>>> using a Java program using Lucene interface. It used StandardAnalyzer
>>> for writing and the field was configured as tokenized, indexed and
>>> stored.  Does this affect the SOLR configuration?
>>>
>>> Can you please help me understand what I am missing and how I can
>>> debug it?
>>>
>>> Thanks,
>>> Yetkin
>>>
>>
>>
>>

Mime
View raw message