lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erol Akarsu <eaka...@gmail.com>
Subject Re: Luke and SOLR search giving different results
Date Mon, 03 Dec 2012 18:30:50 GMT
Jack,

I have these in schema.xml that defines "features" as type of text_tr

But unfortunately, this fails.

 <field name="features" type="text_tr" indexed="true" stored="true"
multiValued="true"/>
<copyField source="features" dest="text"/>

<fieldType name="text_tr" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.TurkishLowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_tr.txt" enablePositionIncrements="true"/>
        <filter class="solr.SnowballPorterFilterFactory"
language="Turkish"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.TurkishLowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_tr.txt" enablePositionIncrements="true"/>
        <filter class="solr.SnowballPorterFilterFactory"
language="Turkish"/>
      </analyzer>
    </fieldType>



On Mon, Dec 3, 2012 at 1:15 PM, Jack Krupansky <jack@basetechnology.com>wrote:

> Ah! See where it says "<str name="parsedquery_toString">**text:baş</str>"?
> Your query is against the "text" field, which probably doesn't have the
> Turkish analysis.
>
> There is probably a copyField from "features" to "text". You use the
> "text_tr" field type for "features", but probably not for the "text" field.
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Erol Akarsu
> Sent: Monday, December 03, 2012 1:06 PM
>
> To: solr-user@lucene.apache.org
> Subject: Re: Luke and SOLR search giving different results
>
> Jack,
>
> I have already set tomcat server fro UTF-Encoding before. I have added
> URIEncoding="UTF-8" to all <Connector ..> elements in server.xml in Tomcat
> 7.
>
> As you see below, when I search  word "baş"  with debug mode I can see
> empty response. But  when I search word "baştan", I can get correct
> response.
>
> It seems to me that TurkishAnalyser is not being used in SOLR search
> because we can make only full word search "baştan" but not the root word
> "baş". Probably, English Analyzer is being used and could not find the root
> word. For example, in Luke, if I change "Analyser to use for query parsing"
> to EnglishAnalyser, then it can not find word "baş" but it can with
> TurkishAnalyser" only. I guess SOLR is not using TurkishAnalyzer.
>
> Is this assumption true? I could not find any other reason
>
>
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
>    <lst name="responseHeader">
>        <int name="status">0</int>
>        <int name="QTime">58</int>
>        <lst name="params">
>            <str name="debugQuery">true</str>
>            <str name="q">baş</str>
>            <str name="wt">xml</str>
>        </lst>
>    </lst>
>    <result name="response" numFound="0" start="0" />
>    <lst name="debug">
>        <str name="rawquerystring">baş</**str>
>        <str name="querystring">baş</str>
>        <str name="parsedquery">text:baş</**str>
>        <str name="parsedquery_toString">**text:baş</str>
>        <lst name="explain" />
>        <str name="QParser">LuceneQParser</**str>
>        <lst name="timing">
>            <double name="time">38.0</double>
>            <lst name="prepare">
>                <double name="time">16.0</double>
>                <lst
> name="org.apache.solr.handler.**component.QueryComponent">
>                    <double name="time">3.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.FacetComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.**MoreLikeThisComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.HighlightComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.StatsComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.DebugComponent">
>                    <double name="time">0.0</double>
>                </lst>
>            </lst>
>            <lst name="process">
>                <double name="time">10.0</double>
>                <lst
> name="org.apache.solr.handler.**component.QueryComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.FacetComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.**MoreLikeThisComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.HighlightComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.StatsComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.DebugComponent">
>                    <double name="time">10.0</double>
>                </lst>
>            </lst>
>        </lst>
>    </lst>
> </response>
>
> <response>
>    <lst name="responseHeader">
>        <int name="status">0</int>
>        <int name="QTime">2</int>
>        <lst name="params">
>            <str name="debugQuery">true</str>
>            <str name="q">baştan</str>
>            <str name="wt">xml</str>
>        </lst>
>    </lst>
>    <result name="response" numFound="1" start="0">
>        <doc>
>            <str name="url">htt://111.a.b1</**str>
>            <str name="id">6H500F0XXXX</str>
>            <str name="lang">tr</str>
>            <str name="name">Maxtor DiamondMax 11 - hard drive - 500 GB -
> SATA-300
>            </str>
>            <str name="manu">Maxtor Corp.</str>
>            <str name="manu_id_s">maxtor</str>
>            <arr name="cat">
>                <str>electronics</str>
>                <str>hard drive</str>
>            </arr>
>            <arr name="features">
>                <str>SATA 3.0Gb/s, NCQ</str>
>                <str>8.5ms seek</str>
>                <str>16MB cache</str>
>                <str>
>                    Firmalarsa "Nasılsa buldum oynatacak ünlüyü, neyleyim
> senaryoyu!" diyerek
>                    baştan savma reklamlarla kotarmaya bakıyor işi.
> Futbolcu Arda Turan
>                    ve büyük umutlarla Türkiye'ye getirilen Paris Hilton'un
> oynatıldığı
>                    giyim firması reklamı da tam bir fiyasko. Birbirinden
> ünlü bu iki
>                    ismin oynadığı reklam Arda'nın kabinde papağan gibi
> tekrarladığı
>                    "My darling!" repliği, sonunda Paris'i görünce anlam
> veremediğimiz
>                    uyduruk bayılma sahnesi, bir de Paris'in ancak 5 kez
> izledikten
>                    sonra anlaşılan "Paris seçti, firma yaptı, Arda
> bayıldı."
>                    sözleriyle kazındı hafızalara, "Keşke unutabilsek!"
> dedirterek.
>                </str>
>            </arr>
>            <float name="price">350.0</float>
>            <str name="price_c">350,USD</str>
>            <int name="popularity">6</int>
>            <bool name="inStock">true</bool>
>            <date name="manufacturedate_dt">**2006-02-13T15:26:37Z</date>
>            <long name="_version_">**1420300467908378624</long>
>        </doc>
>    </result>
>    <lst name="debug">
>        <str name="rawquerystring">baştan</**str>
>        <str name="querystring">baştan</**str>
>        <str name="parsedquery">text:**baştan</str>
>        <str name="parsedquery_toString">**text:baştan</str>
>        <lst name="explain">
>            <str name="6H500F0XXXX">
>                0.028767452 = (MATCH) weight(text:baştan in 0)
> [DefaultSimilarity], result of:
>                0.028767452 = fieldWeight in 0, product of:
>                1.0 = tf(freq=1.0), with freq of:
>                1.0 = termFreq=1.0
>                0.30685282 = idf(docFreq=1, maxDocs=1)
>                0.09375 = fieldNorm(doc=0)
>            </str>
>        </lst>
>        <str name="QParser">LuceneQParser</**str>
>        <lst name="timing">
>            <double name="time">2.0</double>
>            <lst name="prepare">
>                <double name="time">1.0</double>
>                <lst
> name="org.apache.solr.handler.**component.QueryComponent">
>                    <double name="time">1.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.FacetComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.**MoreLikeThisComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.HighlightComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.StatsComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.DebugComponent">
>                    <double name="time">0.0</double>
>                </lst>
>            </lst>
>            <lst name="process">
>                <double name="time">1.0</double>
>                <lst
> name="org.apache.solr.handler.**component.QueryComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.FacetComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.**MoreLikeThisComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.HighlightComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.StatsComponent">
>                    <double name="time">0.0</double>
>                </lst>
>                <lst
> name="org.apache.solr.handler.**component.DebugComponent">
>                    <double name="time">1.0</double>
>                </lst>
>            </lst>
>        </lst>
>    </lst>
> </response>
>
> On Mon, Dec 3, 2012 at 12:30 PM, Jack Krupansky <jack@basetechnology.com>*
> *wrote:
>
>  Two points:
>>
>> 1. Possibly an encoding problem with your container? Is UTF-8 encoding
>> enabled?
>> 2. Add &debugQuery=true to your query (from the browser) and see if the
>> parser_query has the expected term that matches what Luke reports for the
>> index and what Solr Admin Analysis also reports for index analysis.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Erol Akarsu
>> Sent: Monday, December 03, 2012 11:35 AM
>>
>> To: solr-user@lucene.apache.org
>> Subject: Re: Luke and SOLR search giving different results
>>
>> Jack,
>>
>> Yes.
>>
>> I expect SOLR should give same search results as Luked does.
>>
>> Term analyzer gives correct answer in SOLR as expected. But SOLR does not
>> return correct search results.
>>
>> I don't know why.
>>
>> Erol Akarsu
>>
>> On Mon, Dec 3, 2012 at 11:21 AM, Jack Krupansky <jack@basetechnology.com
>> >*
>> *wrote:
>>
>>
>>  So, does that highlight the problem for you or not? Is the term analyzed
>>
>>> as you expected?
>>>
>>> -- Jack Krupansky
>>>
>>> From: Erol Akarsu
>>> Sent: Monday, December 03, 2012 8:44 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Luke and SOLR search giving different results
>>>
>>> Jack,
>>>
>>> Thanks for help.
>>>
>>> I removed data folder  of SOLR and indexed this sample doc from scratch,
>>> there was no document in SOLR but only one.
>>>
>>> When I analysed , I can see stemming is correct and I can see these for
>>> words "bul", "baş" ,"gör" and "umut" in SF row
>>> I attached analyse screens
>>>
>>> Erol Akarsu
>>>
>>>
>>> On Sun, Dec 2, 2012 at 11:00 PM, Jack Krupansky <jack@basetechnology.com
>>> >
>>> wrote:
>>>
>>>   Have you tried using the Solr Admin Analysis page, using the word and a
>>> few words of context for index analysis and the word alone for query
>>> analysis?
>>>
>>>   And be sure to fully reindex if you change ANYTHING in the schema
>>> fields
>>> or field types.
>>>
>>>   -- Jack Krupansky
>>>
>>>   From: Erol Akarsu
>>>   Sent: Sunday, December 02, 2012 10:38 PM
>>>   To: solr-user@lucene.apache.org
>>>   Subject: Luke and SOLR search giving different results
>>>
>>>
>>>   Hi,
>>>
>>>   I am trying to apply SOLR for Turkish Language for my research.
>>>
>>>   Instead of using language identification, I manually assigned Turkish
>>> language for a sample test document. I have configured SOLR schema.xml,
>>> activated the part below. I have added the attached document
>>> testTurkishDoc.xml that is inserted to SOLR database.
>>>
>>>   But searching for raw Lucene index through Luke and SOLR 4.0 search
>>> though GUI is giving different results. In picture Selection_006.png, the
>>> word "baş" is listed as top term. I search the word "baş" in Luke and I
>>> got
>>> the result result that is only document, shown in Selection_004.png.
>>>
>>>   But in SOLR GUI, I am getting empty result for word "baş" in picture
>>> Selection_002.png.
>>>
>>>   In the text we have  features field, that has word "baştan" that is
>>> being derived from root word "baş" in Turkish Grammar. Somehow, SOLR GUI
>>> is
>>> doing search different than Luke. I could not figure it out why I could
>>> not
>>> find it while getting in Luke. The same thing happens for words "umut",
>>> "bul" and "gör".
>>>
>>>   I will appreciate if you can help me to get same results from SOLR UI.
>>>
>>>
>>>   <field name="features">
>>>          Firmalarsa "Nasılsa buldum oynatacak ünlüyü, neyleyim
>>> senaryoyu!"
>>> diyerek baştan savma reklamlarla kotarmaya bakıyor işi. Futbolcu Arda
>>> Turan
>>> ve büyük umutlarla Türkiye'ye getirilen Paris Hilton'un oynatıldığı giyim
>>> firması reklamı da tam bir fiyasko. Birbirinden ünlü bu iki ismin
>>> oynadığı
>>> reklam Arda'nın kabinde papağan gibi tekrarladığı "My darling!" repliği,
>>> sonunda Paris'i görünce anlam veremediğimiz uyduruk bayılma sahnesi, bir
>>> de
>>> Paris'in ancak 5 kez izledikten sonra anlaşılan "Paris seçti, firma
>>> yaptı,
>>> Arda bayıldı." sözleriyle kazındı hafızalara, "Keşke unutabilsek!"
>>> dedirterek.
>>>     </field>
>>>
>>>
>>>
>>>   Added to schema.xml for SOLR:
>>>
>>>   <field name="features" type="text_tr" indexed="true" stored="true"
>>> multiValued="true"/>
>>>   <fieldType name="text_tr" class="solr.TextField"
>>> positionIncrementGap="100">
>>>         <analyzer type="index">
>>>           <tokenizer class="solr.****StandardTokenizerFactory"/>
>>>           <filter class="solr.****TurkishLowerCaseFilterFactory"****/>
>>>
>>>           <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="lang/stopwords_tr.txt" enablePositionIncrements="****true"/>
>>>           <filter class="solr.****SnowballPorterFilterFactory"
>>>
>>> language="Turkish"/>
>>>         </analyzer>
>>>         <analyzer type="query">
>>>           <tokenizer class="solr.****StandardTokenizerFactory"/>
>>>           <filter class="solr.****TurkishLowerCaseFilterFactory"****/>
>>>
>>>           <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="lang/stopwords_tr.txt" enablePositionIncrements="****true"/>
>>>           <filter class="solr.****SnowballPorterFilterFactory"
>>>
>>> language="Turkish"/>
>>>         </analyzer>
>>>       </fieldType>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message