lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bogdan Vatkov <bogdan.vat...@gmail.com>
Subject Re: Unstemming after solr.PorterStemFilterFactory
Date Wed, 20 Jan 2010 17:43:05 GMT
Thanks! It is good to know I did not do something in vаin :)

On Wed, Jan 20, 2010 at 6:54 PM, Erick Erickson <erickerickson@gmail.com>wrote:

> Ah, OK. I take the "unnecessary" comment back. If you require
> the original form of the tokens (not just the original text), then you
> do have to do something to preserve them, so I think you're on
> the right track....
>
> FWIW
> Erick
>
> On Wed, Jan 20, 2010 at 9:38 AM, Bogdan Vatkov <bogdan.vatkov@gmail.com
> >wrote:
>
> > Hi Eric,
> >
> > I think I realize that and I am actually using this - I am using the
> > stemmed, cased etc. token from the stored "term vectors" and additionally
> I
> > am using the field values.
> > But the fields values are different from the tokens in the level of
> > granularity.
> > When I access the term vector for my field "body" I get the tokens:
> > "old", "man", "sea" (the rest is stopwords)
> > while if I use document.getter methods for my field I get the value of
> the
> > field "body", which is:
> > "The Old Man and the Sea"
> > But.. what I actually need is the original version of the tokens and not
> > the
> > field value itself, in that example I need:
> > "Old" "Man", "Sea"
> > and not
> > "The Old Man and the Sea"
> > that is why I had to do my version of that filter so that during tokens
> > transformation (stemming, lowercasing) I store a map of the filtered term
> > -to- original term.
> >
> > I am using Apache Mahout to read from Solr index (term vectors) and
> cluster
> > Solr documents based on these terms (tokens) and the clustering process
> > itself works with the stemmed, lowercased terms while at the end I want
> to
> > present the original terms - and the only way I found is by using this
> > stemmed term-to-original-token-map which I build during stemming.
> > Am I missing some existing method to access stored tokens before they get
> > stemmed?
> >
> > Best regards,
> > Bogdan
> >
> > On Wed, Jan 20, 2010 at 2:39 AM, Erick Erickson <erickerickson@gmail.com
> > >wrote:
> >
> > > This is completely unnecessary. Fields can be both indexed and
> > > stored, and the operations are orthogonal.
> > >
> > > That is, when you specify that a field is indexed, it is run through
> > > an analyzer and the *tokens* are indexed, after any
> > > stemming, casing, etc.
> > >
> > > Stored means that the original value, before any analysis
> > > whatsoever, is put in a completely separate location.
> > > It's only there for retrieval and display to the user. It's as if
> > > a copy of the original text was put into one place, and the
> > > tokens were put in another.
> > >
> > > Consider the problem of book titles. If I have a title "The Old
> > > Man and the Sea", I want to display that title as a result of
> > > searching for "old sea man". Rather than force the separate
> > > storage to be done programmatically, SOLR allows you to
> > > specify these two options. So if I specify indexing and storing,
> > > the tokens "old" "man" "sea" (assuming lowercasing,
> > > stopword removal, etc) are added to the searchable index.
> > > "The Old Man and the Sea" is copied somewhere else, and
> > > when you ask for the *value* of the field, you get "The Old Man
> > > and the Sea". This stored part of the index is never searched, it
> > > is solely there for retrieval/display.
> > >
> > > I'd really get a copy of the book, it'll save you lots of time and
> > > effort.
> > >
> > > HTH
> > > Erick
> > >
> > > On Tue, Jan 19, 2010 at 5:45 PM, Bogdan Vatkov <
> bogdan.vatkov@gmail.com
> > > >wrote:
> > >
> > > > I am using fields like:
> > > >  <field name="msg_body" type="body_text" termVectors="true"
> > > indexed="true"
> > > > stored="true"/>
> > > > which contain multi-line text, not just single strings, what does
> > "stored
> > > > values" mean?
> > > > I am relatively new to Solr
> > > >
> > > > I solved my issue by copy/pasting and enhancing
> > > > the SnowballPorterFilterFactory class by
> > > > creating SnowballPorterWithUnstemLowerCaseFilterFactory
> > > > I added lowercasing inside the factory since I need to capture the
> > > original
> > > > terms store them in a side file and only then lowercase and stem.
> > > >
> > > >    <fieldType name="body_text" class="solr.TextField"
> > > > positionIncrementGap="100">
> > > >      <analyzer type="index">
> > > >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > > >        <filter class="solr.StopFilterFactory"
> > > >                ignoreCase="true"
> > > >                words="stopwords.txt"
> > > >                enablePositionIncrements="true"
> > > >                />
> > > >        <filter class="solr.WordDelimiterFilterFactory"
> > > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > > > <!--        <filter class="solr.LowerCaseFilterFactory"/> -->
> > > > <!--        <filter class="solr.SnowballPorterFilterFactory"
> > > > language="English" protected="protwords.txt"/> -->
> > > >        <filter
> > > >
> > > >
> > >
> >
> class="org.bogdan.solr.analysis.SnowballPorterWithUnstemLowerCaseFilterFactory"
> > > > language="English" protected="protwords.txt"
> > unstemmed="unstemmed.txt"/>
> > > >      </analyzer>
> > > >
> > > > I was wondering if there is an easier way (without doing this custom
> > > filter
> > > > that I did).
> > > >
> > > > Best regards,
> > > > Bogdan
> > > >
> > > > On Wed, Jan 20, 2010 at 12:38 AM, Otis Gospodnetic <
> > > > otis_gospodnetic@yahoo.com> wrote:
> > > >
> > > > > Bogdan,
> > > > >
> > > > > You can get them from stored values of your fields, if you are
> > storing
> > > > > them.
> > > > >
> > > > > Otis
> > > > > --
> > > > > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> > > > >
> > > > >
> > > > >
> > > > > ----- Original Message ----
> > > > > > From: Bogdan Vatkov <bogdan.vatkov@gmail.com>
> > > > > > To: solr-user@lucene.apache.org
> > > > > > Sent: Tue, January 19, 2010 5:28:51 PM
> > > > > > Subject: Unstemming after solr.PorterStemFilterFactory
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am indexing with the solr.PorterStemFilterFactory included
but
> > then
> > > I
> > > > > need
> > > > > > to access the unstemmed versions of the terms, what would be
the
> > > > easiest
> > > > > way
> > > > > > to get the unstemmed version?
> > > > > > Thanks in advance.
> > > > > >
> > > > > > Best regards,
> > > > > > Bogdan
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Bogdan
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Bogdan
> > > >
> > >
> >
> >
> >
> > --
> > Best regards,
> > Bogdan
> >
>



-- 
Best regards,
Bogdan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message