lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Unstemming after solr.PorterStemFilterFactory
Date Wed, 20 Jan 2010 00:39:07 GMT
This is completely unnecessary. Fields can be both indexed and
stored, and the operations are orthogonal.

That is, when you specify that a field is indexed, it is run through
an analyzer and the *tokens* are indexed, after any
stemming, casing, etc.

Stored means that the original value, before any analysis
whatsoever, is put in a completely separate location.
It's only there for retrieval and display to the user. It's as if
a copy of the original text was put into one place, and the
tokens were put in another.

Consider the problem of book titles. If I have a title "The Old
Man and the Sea", I want to display that title as a result of
searching for "old sea man". Rather than force the separate
storage to be done programmatically, SOLR allows you to
specify these two options. So if I specify indexing and storing,
the tokens "old" "man" "sea" (assuming lowercasing,
stopword removal, etc) are added to the searchable index.
"The Old Man and the Sea" is copied somewhere else, and
when you ask for the *value* of the field, you get "The Old Man
and the Sea". This stored part of the index is never searched, it
is solely there for retrieval/display.

I'd really get a copy of the book, it'll save you lots of time and
effort.

HTH
Erick

On Tue, Jan 19, 2010 at 5:45 PM, Bogdan Vatkov <bogdan.vatkov@gmail.com>wrote:

> I am using fields like:
>  <field name="msg_body" type="body_text" termVectors="true" indexed="true"
> stored="true"/>
> which contain multi-line text, not just single strings, what does "stored
> values" mean?
> I am relatively new to Solr
>
> I solved my issue by copy/pasting and enhancing
> the SnowballPorterFilterFactory class by
> creating SnowballPorterWithUnstemLowerCaseFilterFactory
> I added lowercasing inside the factory since I need to capture the original
> terms store them in a side file and only then lowercase and stem.
>
>    <fieldType name="body_text" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> <!--        <filter class="solr.LowerCaseFilterFactory"/> -->
> <!--        <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/> -->
>        <filter
>
> class="org.bogdan.solr.analysis.SnowballPorterWithUnstemLowerCaseFilterFactory"
> language="English" protected="protwords.txt" unstemmed="unstemmed.txt"/>
>      </analyzer>
>
> I was wondering if there is an easier way (without doing this custom filter
> that I did).
>
> Best regards,
> Bogdan
>
> On Wed, Jan 20, 2010 at 12:38 AM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com> wrote:
>
> > Bogdan,
> >
> > You can get them from stored values of your fields, if you are storing
> > them.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> >
> >
> >
> > ----- Original Message ----
> > > From: Bogdan Vatkov <bogdan.vatkov@gmail.com>
> > > To: solr-user@lucene.apache.org
> > > Sent: Tue, January 19, 2010 5:28:51 PM
> > > Subject: Unstemming after solr.PorterStemFilterFactory
> > >
> > > Hi,
> > >
> > > I am indexing with the solr.PorterStemFilterFactory included but then I
> > need
> > > to access the unstemmed versions of the terms, what would be the
> easiest
> > way
> > > to get the unstemmed version?
> > > Thanks in advance.
> > >
> > > Best regards,
> > > Bogdan
> > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Bogdan
> >
> >
>
>
> --
> Best regards,
> Bogdan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message