lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Unstemming after solr.PorterStemFilterFactory
Date Wed, 20 Jan 2010 16:54:46 GMT
Ah, OK. I take the "unnecessary" comment back. If you require
the original form of the tokens (not just the original text), then you
do have to do something to preserve them, so I think you're on
the right track....

FWIW
Erick

On Wed, Jan 20, 2010 at 9:38 AM, Bogdan Vatkov <bogdan.vatkov@gmail.com>wrote:

> Hi Eric,
>
> I think I realize that and I am actually using this - I am using the
> stemmed, cased etc. token from the stored "term vectors" and additionally I
> am using the field values.
> But the fields values are different from the tokens in the level of
> granularity.
> When I access the term vector for my field "body" I get the tokens:
> "old", "man", "sea" (the rest is stopwords)
> while if I use document.getter methods for my field I get the value of the
> field "body", which is:
> "The Old Man and the Sea"
> But.. what I actually need is the original version of the tokens and not
> the
> field value itself, in that example I need:
> "Old" "Man", "Sea"
> and not
> "The Old Man and the Sea"
> that is why I had to do my version of that filter so that during tokens
> transformation (stemming, lowercasing) I store a map of the filtered term
> -to- original term.
>
> I am using Apache Mahout to read from Solr index (term vectors) and cluster
> Solr documents based on these terms (tokens) and the clustering process
> itself works with the stemmed, lowercased terms while at the end I want to
> present the original terms - and the only way I found is by using this
> stemmed term-to-original-token-map which I build during stemming.
> Am I missing some existing method to access stored tokens before they get
> stemmed?
>
> Best regards,
> Bogdan
>
> On Wed, Jan 20, 2010 at 2:39 AM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > This is completely unnecessary. Fields can be both indexed and
> > stored, and the operations are orthogonal.
> >
> > That is, when you specify that a field is indexed, it is run through
> > an analyzer and the *tokens* are indexed, after any
> > stemming, casing, etc.
> >
> > Stored means that the original value, before any analysis
> > whatsoever, is put in a completely separate location.
> > It's only there for retrieval and display to the user. It's as if
> > a copy of the original text was put into one place, and the
> > tokens were put in another.
> >
> > Consider the problem of book titles. If I have a title "The Old
> > Man and the Sea", I want to display that title as a result of
> > searching for "old sea man". Rather than force the separate
> > storage to be done programmatically, SOLR allows you to
> > specify these two options. So if I specify indexing and storing,
> > the tokens "old" "man" "sea" (assuming lowercasing,
> > stopword removal, etc) are added to the searchable index.
> > "The Old Man and the Sea" is copied somewhere else, and
> > when you ask for the *value* of the field, you get "The Old Man
> > and the Sea". This stored part of the index is never searched, it
> > is solely there for retrieval/display.
> >
> > I'd really get a copy of the book, it'll save you lots of time and
> > effort.
> >
> > HTH
> > Erick
> >
> > On Tue, Jan 19, 2010 at 5:45 PM, Bogdan Vatkov <bogdan.vatkov@gmail.com
> > >wrote:
> >
> > > I am using fields like:
> > >  <field name="msg_body" type="body_text" termVectors="true"
> > indexed="true"
> > > stored="true"/>
> > > which contain multi-line text, not just single strings, what does
> "stored
> > > values" mean?
> > > I am relatively new to Solr
> > >
> > > I solved my issue by copy/pasting and enhancing
> > > the SnowballPorterFilterFactory class by
> > > creating SnowballPorterWithUnstemLowerCaseFilterFactory
> > > I added lowercasing inside the factory since I need to capture the
> > original
> > > terms store them in a side file and only then lowercase and stem.
> > >
> > >    <fieldType name="body_text" class="solr.TextField"
> > > positionIncrementGap="100">
> > >      <analyzer type="index">
> > >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >        <filter class="solr.StopFilterFactory"
> > >                ignoreCase="true"
> > >                words="stopwords.txt"
> > >                enablePositionIncrements="true"
> > >                />
> > >        <filter class="solr.WordDelimiterFilterFactory"
> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > > <!--        <filter class="solr.LowerCaseFilterFactory"/> -->
> > > <!--        <filter class="solr.SnowballPorterFilterFactory"
> > > language="English" protected="protwords.txt"/> -->
> > >        <filter
> > >
> > >
> >
> class="org.bogdan.solr.analysis.SnowballPorterWithUnstemLowerCaseFilterFactory"
> > > language="English" protected="protwords.txt"
> unstemmed="unstemmed.txt"/>
> > >      </analyzer>
> > >
> > > I was wondering if there is an easier way (without doing this custom
> > filter
> > > that I did).
> > >
> > > Best regards,
> > > Bogdan
> > >
> > > On Wed, Jan 20, 2010 at 12:38 AM, Otis Gospodnetic <
> > > otis_gospodnetic@yahoo.com> wrote:
> > >
> > > > Bogdan,
> > > >
> > > > You can get them from stored values of your fields, if you are
> storing
> > > > them.
> > > >
> > > > Otis
> > > > --
> > > > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> > > >
> > > >
> > > >
> > > > ----- Original Message ----
> > > > > From: Bogdan Vatkov <bogdan.vatkov@gmail.com>
> > > > > To: solr-user@lucene.apache.org
> > > > > Sent: Tue, January 19, 2010 5:28:51 PM
> > > > > Subject: Unstemming after solr.PorterStemFilterFactory
> > > > >
> > > > > Hi,
> > > > >
> > > > > I am indexing with the solr.PorterStemFilterFactory included but
> then
> > I
> > > > need
> > > > > to access the unstemmed versions of the terms, what would be the
> > > easiest
> > > > way
> > > > > to get the unstemmed version?
> > > > > Thanks in advance.
> > > > >
> > > > > Best regards,
> > > > > Bogdan
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Bogdan
> > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Bogdan
> > >
> >
>
>
>
> --
> Best regards,
> Bogdan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message