lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Searching for tokens does not return any results
Date Fri, 02 May 2014 15:56:40 GMT
bq:  but this index was created using a Java program using Lucene interface

Elaborating a bit on Koji's comment...

The fact that you used Lucene to index the doc means that the analysis
page is almost, but not quite entirely, useless on the indexing side.
It's looking at your field definition in schema.xml and running your
input stream through the indexing portion of your analysis chain
constructed from the schema. What's actually in your index though was
put there by raw Lucene. So your Lucene program _must_ create an
analysis chain that is absolutely identical to what's in your schema
for the admin/analysis page to be accurate.

Quick test: go to you "admin/schema browser" page or use the
TermsComponent (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
or Luke to examine the actual tokens in your field. My bet is that
you'll see that the actual terms are not what you expect and almost
certainly not what the admin/analysis page shows on the index side.

Keeping an independent Lucene program that puts data into your index
with raw Lucene aligned with your schema is, as you can see, something
of a problem. If at all possible, consider letting Solr do the
indexing and sending it documents with SolrJ, here's a reference:
https://cwiki.apache.org/confluence/display/solr/Using+SolrJ

By the way, I want to compliment you on your post. You did all the right things:
> defined your problem clearly
> added the critical bit (index created with Lucene). This is especially relevant I think
> illustrated the input and output
> told us what the problem was
> gave us the field definitions
> showed the results of some of your investigation

Best
Erick

On Thu, May 1, 2014 at 7:31 AM, Koji Sekiguchi <koji@r.email.ne.jp> wrote:
> Hi Yetkin, welcome!
>
> I think StandardAnalyzer of Lucene is the problem you are facing.
>
> Why don't you have another field using StandardAnalyzer and see how it
> tokenizes CRD_PROD
> on Solr admin GUI?
>
> I forgot in the detail but we can use Lucene's Analyzer in schema.xml
> something like this:
>
> <fieldType ...>
>    <analyzer class="solr.StandardAnalyzer"/>
> </fieldType>
>
> Koji
> --
> http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html
>
>
> (2014/05/01 23:04), Yetkin Ozkucur wrote:
>>
>> Hello everyone,
>>
>> I am new to SOLR and this is my first post in this list.
>> I have been working on this problem for a couple of days. I tried
>> everything which I found in google but it looks like I am missing something.
>>
>> Here is my problem:
>> I have a field called: DBASE_LOCAT_NM_TEXT
>> It contains values like: CRD_PROD
>> The goal is to be able to search this field either by putting the exact
>> string "CRD_PROD" or part of it (tokenized by "_")  like "CRD" or "PROD"
>>
>> Currently:
>> This query returns results: q=DBASE_LOCAT_NM_TEXT:CRD_PROD
>> But this does not: q=DBASE_LOCAT_NM_TEXT:CRD
>> I want to understand why the second query does not return any results
>>
>> Here is how I configured the field:
>> <field name="DBASE_LOCAT_NM_TEXT" type="text_general" indexed="true"
>> stored="true" required="false" multiValued="false"/>
>>
>> And Here is how I configured the field type :
>>      <fieldType name="text_general" class="solr.TextField"
>> positionIncrementGap="100">
>>        <analyzer type="index">
>>        <filter class="solr.WordDelimiterFilterFactory"
>> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
>> catenateWords="1" catenateNumbers="1" catenateAll="0"
>> splitOnCaseChange="1"/>
>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>          <filter class="solr.StopFilterFactory"  ignoreCase="true"
>> words="stopwords.txt"/>
>>           <filter class="solr.LowerCaseFilterFactory"/>
>>          <filter class="solr.KeywordMarkerFilterFactory"
>> protected="protwords.txt"/>
>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>        </analyzer>
>>        <analyzer type="query">
>>          <filter class="solr.WordDelimiterFilterFactory"
>> preserveOriginal="1" generateWordParts="1" generateNumberParts="1"
>> catenateWords="0" catenateNumbers="0" catenateAll="0"
>> splitOnCaseChange="1"/>
>>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>          <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt"/>
>>
>>          <filter class="solr.LowerCaseFilterFactory"/>
>>          <filter class="solr.KeywordMarkerFilterFactory"
>> protected="protwords.txt"/>
>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>
>>        </analyzer>
>>      </fieldType>
>>
>> I am also using the analysis panel in the SOLR admin console. It shows
>> this:
>> WT      CRD_PROD
>>
>> WDF     CRD_PROD
>>         CRD
>>         PROD
>>         CRDPROD
>>
>> SF      CRD_PROD
>>         CRD
>>         PROD
>>         CRDPROD
>>
>> LCF     crd_prod
>>         crd
>>         prod
>>         crdprod
>>
>> SKMF    crd_prod
>>         crd
>>         prod
>>         crdprod
>>
>> RDTF    crd_prod
>>         crd
>>         prod
>>         crdprod
>>
>>
>> I am not sure if it is related or not but this index was created using a
>> Java program using Lucene interface. It used StandardAnalyzer for writing
>> and the field was configured as tokenized, indexed and stored.  Does this
>> affect the SOLR configuration?
>>
>> Can you please help me understand what I am missing and how I can debug
>> it?
>>
>> Thanks,
>> Yetkin
>>
>
>
>

Mime
View raw message