lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manjula Wijewickrema <manjul...@gmail.com>
Subject Re: bigram problem
Date Thu, 03 Jul 2014 04:06:09 GMT
Dear Parnab,

Thanks a lot for your guidance. I prefer to follow the second method, as I
have already indexed the bigrams using ShingleFilterWrapper. But, I have no
any idea about how to use NGramTokenizer here. So, could you please write
one or two lines of the code which shows how to use NGramTokenizer for
bigrams.

Thanks,
Manjula.


On Wed, Jul 2, 2014 at 7:05 PM, parnab kumar <parnab.2007@gmail.com> wrote:

> TF is straight forward, you can simply count the no of occurrences in the
> doc by simple string matching. For IDF you need to know total no of docs in
> the collection and the no. of docs having the bigram. reader.maxDoc() will
> give you the total no of docs in the collection. To calculate the number of
> docs containing the bigram use a phrase query with slop factor set to 0.
> The number of docs returned by the indexsearcher with the phrase query will
> be the number of docs having the bigram. I hope this is fine.
>
> Alternatively, use   NGramTokenizer where ( n=2 in your case) while
> indexing. In such a case, each bigram can interpreted as a normal lucene
> term.
>
> Thanks,
> Parnab
>
>
> On Wed, Jul 2, 2014 at 8:45 AM, Manjula Wijewickrema <manjula53@gmail.com>
> wrote:
>
> > Hi,
> >
> > Could please explain me how to determine the tf-idf score for bigrams. My
> > program is able to index and search bigrams correctly, but it does not
> > calculate the tf-idf for bigrams. If someone can, please help me to
> resolve
> > this.
> >
> > Regards,
> > Manjula.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message