lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Wunderlich <martin...@gmx.net>
Subject Re: Applying Tokenizers and Filters to CopyFields
Date Wed, 25 Mar 2015 21:13:40 GMT
Thanks a lot, Ahmet. I’ve just read up on this query field parameter and it sounds good.
Since the field contents are currently all identical, I can’t really test it, yet. 

Cheers, 

Martin
 



> Am 25.03.2015 um 21:27 schrieb Ahmet Arslan <iorixxx@yahoo.com.INVALID>:
> 
> Hi Martin,
> 
> fq means filter query. May be you want to use qf (query fields) parameter of edismax?
> 
> 
> 
> On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich <martin_wu@gmx.net> wrote:
> Hi all, 
> 
> I am wondering what the process is for applying Tokenizers and Filter (as defined in
the FieldType definition) to field contents that result from CopyFields. To be more specific,
in my Solr instance, Iwould like to support query expansion by two means: removing stop words
and adding inflected word forms as synonyms. 
> 
> To use a specific example, let’s say I have the following sentence to be indexed (from
a Wittgenstein manuscript): 
> 
> "Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“
> 
> 
> This sentence will be indexed in a field called „original“ that is defined as follows:

> 
> <field name="original" type="text_original" indexed="true" stored="true" required="true“/>
> 
>    <fieldType name="text_windex_original" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>      </analyzer>
>    </fieldType>
> 
> 
> Then, in order to create fields for the two types of query expansion, I have set up specific
fields for this: 
> 
> - one field where stopwords are removed both on the indexed content and the query. So,
if the users is searching for a phrase like „der Sprache“, Solr should still find the
segment above, because the determiners („der“ and „die“) are removed prior to indexing
and prior to querying, respectively. This field is defined as follows: 
> 
> <field name="stopwords_removed" type="text_stopwords_removed" indexed="true" stored="true"
required="true“/>
> 
>    <fieldType name="text_stopwords_removed" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" words=„stopwords_de.txt"
format="snowball"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt"
format="snowball"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
> 
> 
> - a second field where synonyms are added to the query so that more segments will be
found. For instance, if the user is searching for the plural form „Sprachen“, Solr should
return the segment above, due to this entry in the synonyms file: "Sprache,Sprach,Sprachen“.
This field is defined as follows: 
> 
> <field name="expanded" type="text_multiplied" indexed="true" stored="true" required="true“/>expanded
> 
>    <fieldType name="text_expanded" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt"
format="snowball"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_de.txt"
format="snowball"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms_de.txt" ignoreCase="true"
expand="true"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
> 
> Finally, to avoid having to specify three fields with identical content in the import
documents, I am defining the two fields for query expansion as copyFields: 
> 
>  <copyField source="original" dest="stopwords_removed"/>
>  <copyField source="original" dest="expanded“/>
> 
> Now, my expectation would be as follows: 
> - during import, two temporary fields are created by copying content from the original
field
> - these two temporary fields are then pre-processed as per the definitions above
> - the pre-processed version of the text is added to the index
> - then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der
Sprache“ and will always get the segment above as a matching result. 
> 
> However, what happens actually is that I get matches only for „Sprache“ and „sprache“.

> 
> The other thing that strikes as odd, is that when I restrict the search to one of the
fields only using the „fq“ parameter, I get no results. For instance: 
> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
<http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true>
> 
> will return no matches. I would expected that using the fq parameter the user can specify
what type of search (s)he would like to carry out: A standard search (field original) or an
expanded search (one of the other two fields). 
> 
> For debugging, I have checked the analysis and results seem ok (posted below). 
> Apologies for the long post, but I am really a bit stuck here (even after doing a lot
of reading and googling). It is probably something simple that I missing. 
> Thanks a lot in advance for any help. 
> 
> Cheers, 
> 
> Martin
> 
> 
> ST
> Was
> zum
> Wesen
> 
> der
> Welt
> gehört
> kann
> die
> Sprache
> nicht
> ausdrücken
> SF
> Was
> zum
> Wesen
> 
> Welt
> gehört
> kann
> die
> Sprache
> nicht
> ausdrücken
> LCF
> was
> zum
> wesen
> 
> welt
> gehört
> kann
> die
> sprache
> nicht
> ausdrücken


Mime
View raw message