lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bogdan Vatkov <bogdan.vat...@gmail.com>
Subject Re: Unstemming after solr.PorterStemFilterFactory
Date Wed, 20 Jan 2010 14:38:22 GMT
Hi Eric,

I think I realize that and I am actually using this - I am using the
stemmed, cased etc. token from the stored "term vectors" and additionally I
am using the field values.
But the fields values are different from the tokens in the level of
granularity.
When I access the term vector for my field "body" I get the tokens:
"old", "man", "sea" (the rest is stopwords)
while if I use document.getter methods for my field I get the value of the
field "body", which is:
"The Old Man and the Sea"
But.. what I actually need is the original version of the tokens and not the
field value itself, in that example I need:
"Old" "Man", "Sea"
and not
"The Old Man and the Sea"
that is why I had to do my version of that filter so that during tokens
transformation (stemming, lowercasing) I store a map of the filtered term
-to- original term.

I am using Apache Mahout to read from Solr index (term vectors) and cluster
Solr documents based on these terms (tokens) and the clustering process
itself works with the stemmed, lowercased terms while at the end I want to
present the original terms - and the only way I found is by using this
stemmed term-to-original-token-map which I build during stemming.
Am I missing some existing method to access stored tokens before they get
stemmed?

Best regards,
Bogdan

On Wed, Jan 20, 2010 at 2:39 AM, Erick Erickson <erickerickson@gmail.com>wrote:

> This is completely unnecessary. Fields can be both indexed and
> stored, and the operations are orthogonal.
>
> That is, when you specify that a field is indexed, it is run through
> an analyzer and the *tokens* are indexed, after any
> stemming, casing, etc.
>
> Stored means that the original value, before any analysis
> whatsoever, is put in a completely separate location.
> It's only there for retrieval and display to the user. It's as if
> a copy of the original text was put into one place, and the
> tokens were put in another.
>
> Consider the problem of book titles. If I have a title "The Old
> Man and the Sea", I want to display that title as a result of
> searching for "old sea man". Rather than force the separate
> storage to be done programmatically, SOLR allows you to
> specify these two options. So if I specify indexing and storing,
> the tokens "old" "man" "sea" (assuming lowercasing,
> stopword removal, etc) are added to the searchable index.
> "The Old Man and the Sea" is copied somewhere else, and
> when you ask for the *value* of the field, you get "The Old Man
> and the Sea". This stored part of the index is never searched, it
> is solely there for retrieval/display.
>
> I'd really get a copy of the book, it'll save you lots of time and
> effort.
>
> HTH
> Erick
>
> On Tue, Jan 19, 2010 at 5:45 PM, Bogdan Vatkov <bogdan.vatkov@gmail.com
> >wrote:
>
> > I am using fields like:
> >  <field name="msg_body" type="body_text" termVectors="true"
> indexed="true"
> > stored="true"/>
> > which contain multi-line text, not just single strings, what does "stored
> > values" mean?
> > I am relatively new to Solr
> >
> > I solved my issue by copy/pasting and enhancing
> > the SnowballPorterFilterFactory class by
> > creating SnowballPorterWithUnstemLowerCaseFilterFactory
> > I added lowercasing inside the factory since I need to capture the
> original
> > terms store them in a side file and only then lowercase and stem.
> >
> >    <fieldType name="body_text" class="solr.TextField"
> > positionIncrementGap="100">
> >      <analyzer type="index">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <filter class="solr.StopFilterFactory"
> >                ignoreCase="true"
> >                words="stopwords.txt"
> >                enablePositionIncrements="true"
> >                />
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > <!--        <filter class="solr.LowerCaseFilterFactory"/> -->
> > <!--        <filter class="solr.SnowballPorterFilterFactory"
> > language="English" protected="protwords.txt"/> -->
> >        <filter
> >
> >
> class="org.bogdan.solr.analysis.SnowballPorterWithUnstemLowerCaseFilterFactory"
> > language="English" protected="protwords.txt" unstemmed="unstemmed.txt"/>
> >      </analyzer>
> >
> > I was wondering if there is an easier way (without doing this custom
> filter
> > that I did).
> >
> > Best regards,
> > Bogdan
> >
> > On Wed, Jan 20, 2010 at 12:38 AM, Otis Gospodnetic <
> > otis_gospodnetic@yahoo.com> wrote:
> >
> > > Bogdan,
> > >
> > > You can get them from stored values of your fields, if you are storing
> > > them.
> > >
> > > Otis
> > > --
> > > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> > >
> > >
> > >
> > > ----- Original Message ----
> > > > From: Bogdan Vatkov <bogdan.vatkov@gmail.com>
> > > > To: solr-user@lucene.apache.org
> > > > Sent: Tue, January 19, 2010 5:28:51 PM
> > > > Subject: Unstemming after solr.PorterStemFilterFactory
> > > >
> > > > Hi,
> > > >
> > > > I am indexing with the solr.PorterStemFilterFactory included but then
> I
> > > need
> > > > to access the unstemmed versions of the terms, what would be the
> > easiest
> > > way
> > > > to get the unstemmed version?
> > > > Thanks in advance.
> > > >
> > > > Best regards,
> > > > Bogdan
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Bogdan
> > >
> > >
> >
> >
> > --
> > Best regards,
> > Bogdan
> >
>



-- 
Best regards,
Bogdan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message