lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MitchK <mitc...@web.de>
Subject Re: Doing Shingle but also keep special single word
Date Mon, 23 Aug 2010 21:28:45 GMT

No, I mean that you use an additional field (indexed) for searching (i.e.
whitespace-tokenized, so every word - seperated by a whitespace - becomes to
a token .
So you have got two fields (shingle-token-field and single-token-field).
So you can search accross both fields.
This provides several benefits: i.e. you can boost the shingle-field at
query-time, since a match in a shingle-field would mean, that there matches
an exact phrase.

Additionally: You can search with single-word-queries as well as
multi-word-queries.
Furthermore you can apply synonyms to your single-token-field. 

If you want to keep your index as small as possible but as large as needed,
try to understand Lucene's similarity implementation to consider, whether
you can set the field option "omitNorms"=true or
omitTermFreqAndPositions="true". 
http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/Similarity.html
Keep in mind what happens, if you omit one of those options.

A small example of the consequences of setting omitNorms = true;.
doc1: "this is a short example doc"
doc2: "this is a longer example doc for presenting the effect of omitNorms"

If you are searching for "doc" while omitNorms=false your response will look
like this:
doc1,
doc2
This is because the norm-value for doc1 is smaller as the norm-value for
doc2, because doc1 is shorter than doc2 (have a look at the provided link).

If omitNorms=true, the scores for both docs will be equal.

Kind regards,
- Mitch


scott chu wrote:
> 
> I don't quite understand additional-field-way? Do you mean making another 
> field that stores special words particularly but no indexing for that
> field?
> 
> Scott
> 
> ----- Original Message ----- 
> From: "MitchK" <mitch91@web.de>
> To: <solr-user@lucene.apache.org>
> Sent: Sunday, August 22, 2010 11:48 PM
> Subject: Re: Doing Shingle but also keep special single word
> 
> 
>>
>> Hi,
>>
>> keepword-filter is no solution for this problem, since this would lead to
>> the problematic that one has to manage a word-dictionary. As explained, 
>> this
>> would lead to too much effort.
>>
>> You can easily add outputUnigrams=true and check out the analysis.jsp for
>> this field. So you can see how much bigger a single field will become
>> with
>> this option.
>> However, I am quite sure that the difference between using
>> outputUnigrams=true and indexing in a seperate field is not noteworthy.
>>
>> I would suggest you to do it the additionally-field-way, since this would
>> lead to more flexibility in boosting the different fields.
>>
>> Unfortunately, I haven't understood your explanation about the use-case. 
>> But
>> it sounds a little bit like tagging?
>>
>> Kind regards,
>> - Mitch
>>
>>
>> iorixxx wrote:
>>>
>>>> Isn't set outputUnigrams="true" will
>>>> make index size about twice than when it's set to false?
>>>
>>> Sure index will be bigger. I didn't know that this is problem for you. 
>>> But
>>> if you have a list of special single words that you want to keep,
>>> keepwordfilter can eliminate other tokens. So index size will be okey.
>>>
>>>>
>>>> Scott
>>>>
>>>> ----- Original Message ----- From: "Ahmet Arslan" <iorixxx@yahoo.com>
>>>> To: <solr-user@lucene.apache.org>
>>>> Sent: Saturday, August 21, 2010 1:15 AM
>>>> Subject: Re: Doing Shingle but also keep special single
>>>> word
>>>>
>>>>
>>>> >> I am building index with Shingle
>>>> >> filter. We know it's minimum 2-gram but I also
>>>> want keep
>>>> >> some special single word, e.g. IBM, Microsoft,
>>>> etc. i.e. I
>>>> >> want to do a minimum 2-gram but also want to have
>>>> these
>>>> >> single word in my index, Is it possible?
>>>> >
>>>> > outputUnigrams="true" parameter does not work for
>>>> you?
>>>> >
>>>> > After that you can cast <filter
>>>> class="solr.KeepWordFilterFactory" words="keepwords.txt"
>>>> ignoreCase="true"/> with keepwords.txt=IBM, Microsoft.
>>>> >
>>>> >
>>>> >
>>>> >
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>> -- 
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1276506.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> 
> --------------------------------------------------------------------------------
> 
> 
> 
> ¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
> Checked by AVG - www.avg.com
> Version: 9.0.851 / Virus Database: 271.1.1/3083 - Release Date: 08/20/10 
> 14:35:00
> 
> 
> 
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1300497.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mime
View raw message