lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bogdan Vatkov <bogdan.vat...@gmail.com>
Subject Re: Stopwords not working as expected
Date Sun, 03 Jan 2010 02:31:48 GMT
@Mahout experts: could you please, elaborate on that?
It seems that I am stopping successfully quite some words with the stopwords
mechanism in Solr (I do not get search results when querying with stopwords
with the localhost/solr/select interface) but this somehow is not effective
when Solr index gets converted to vectors in the
org.apache.mahout.utils.vectors.lucene.Driver class.
As a result I get clusters which contain (and are even mainly driven by) the
stopwords...
I am still not an expert in reading from Lucene index - is it possible that
the Vector generation uses some "raw" reading of the Solr/Lucene index and
thus getting the stopwords?

Best regards,
Bogdan

On Sun, Jan 3, 2010 at 3:51 AM, Lance Norskog <goksron@gmail.com> wrote:

> Fields are both stored and indexed. The stored copy is exactly what
> you sent in. The index is built with the "text" type's analysis stack
> and is not stored. This output has the stopwords removed. The output
> is not stored in one place, but parts of it are scattered around the
> Lucene index data structures.  When you search for one of these
> stopwords, you should not get any documents.
>
> On Sat, Jan 2, 2010 at 5:20 PM, Bogdan Vatkov <bogdan.vatkov@gmail.com>
> wrote:
> > Hi,
> >
> > I am using a default (example) configuration of Solr and there the
> > stopwording seems to be enabled for both indexing and querying of fields
> of
> > type "text".
> > I have a custom field which is of the "text" type.
> > I have extended the stopwords.txt file with lots of words but when I
> index
> > some documents the index contains stopwords - I can see this with the
> Luke
> > tool.
> > Am I supposed to see these terms in the index after they are declared in
> the
> > stopwords.txt file?
> > What could be wrong?
> >
> > Best regards,
> > Bogdan
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
Best regards,
Bogdan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message