lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From M W <davidtw...@gmail.com>
Subject SynonymFilterFactory and Punctuation
Date Wed, 20 Mar 2013 22:52:48 GMT
I have been reading threads all day regarding this topic and nothing
seems to work the way it says it should. :)  I appreciate any and all
help in this matter.

Solr 4 is working perfectly for in all regards with this one exception.

My requirement from Solr4 is very simple.  I am storing a document
like a job description in a text_general field.

I have added a filter for SynonymFilterFactory so that I can map C++
=> cplusplus and c# => csharp during indexing a querying.

Here is the field definition:

    <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="punctuation-whitelist.txt" ignoreCase="true"
expand="false"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="punctuation-whitelist.txt" ignoreCase="true"
expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Here is the contents of punctuation-whitelist.txt:

c++ => cplusplus
C# => csharp

I have but one document indexed for the purpose of this test, when I
search for resume_text:C++, I get the following result, which is also
the same result I get when I just search for resume_text:c

You can see from the highlighting that solr is matching on the "C" only


<response>
	<lst name="responseHeader">
		<int name="status">0</int>
		<int name="QTime">20</int>
	</lst>
	<result name="response" numFound="1" start="0" maxScore="0.16273327">
		<doc>
			<arr name="resume_text">
				<str>C++ Developer with c# experience, including .net</str>
			</arr>
		</doc>
	</result>
	<lst name="highlighting">
		<lst name="208645">
			<arr name="resume_text">
				<str>&lt;em&gt;C&lt;/em&gt;++ Developer with
&lt;em&gt;c&lt;/em&gt;# experience, including .net</str>
			</arr>
		</lst>
	</lst>
</response>

If I use the Analysis tool in the Solr Web UI, putting "C#" or "C++"
into the Index or Query boxes translates to just "C" in all filters
and tokenizers in the analysis output.

Can someone please explain the _Best_ way to accomplish what I am
trying to do, which is accurately index, search and highlight text
with words like C++ and C#.  I am looking for the "right way" and it's
okay if I have started down the wrong path.

:)

Thank you.
Dave

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message