lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tommaso Teofili <tommaso.teof...@gmail.com>
Subject Re: POS tagging in Lucene
Date Wed, 19 Oct 2016 10:04:45 GMT
I think it might be helpful to handle POS tags as TypeAttributes so that
the input and output texts would cleaner and you can still filter and
retrieve tokens by type (e.g. with TypeTokenFilter).

My 2 cents,
Tommaso


Il giorno mer 19 ott 2016 alle ore 11:56 Niki Pavlopoulou <niki@exonar.com>
ha scritto:

> Hi Steve,
>
> thank you for your answer. I created a custom Lucene Analyser in the end.
> Just to clarify on what I mean, Lucene works perfectly for pure words, but
> since it does not support POS tagging some workaround needs to be done for
> the analysis of tokens with POS tags. For example:
>
> Input without POS tags: "I love Lucene's library. It is perfect."
> Output: List(love, lucene, library, perfect)
>
> Input with POS tags: "I[PRP] love[VBP] Lucene's[NNP] library[NN] It[PRP]
> is[VBZ] perfect[JJ]"
> Output: List(i[prp], love[vbp], lucene's[nnp], library[nn], it[prp],
> is[vbz], perfect[jj])
> *Desired output*: List(love[vbp], lucene[nnp], library[nn], perfect[jj])
>
> If one does the POS tagging after the analysis, then the tags might be
> wrong as the right syntax has been lost. This is why the POS tagging needs
> to happen early on and then the analysis to take place.
>
> Regards,
> Niki.
>
> On 18 October 2016 at 19:59, Steve Rowe <sarowe@gmail.com> wrote:
>
> > Hi Niki,
> >
> > > On Oct 18, 2016, at 7:27 AM, Niki Pavlopoulou <niki@exonar.com> wrote:
> > >
> > > Hi all,
> > >
> > > I am using Lucene and OpenNLP for POS tagging. I would like to support
> > > biGrams with POS tags as well. For example, I would like something like
> > > that:
> > >
> > > Input: (I[PRP], am[VBP], using[VBG], Lucene[NNP])
> > > Output: (I[PRP] am[VBP], am[VBP] using[VBG], using[VBG] Lucene[NNP])
> > >
> > > The problem above is that I do not have "pure" tokens, like "I", "am"
> > etc.,
> > > so the analysis could be wrong if I add the POS tags as an input in
> > Lucene.
> > > Is there a way to solve this, apart from creating my custome Lucene
> > > analyser?
> >
> > To create your bigrams, check out ShingleFilter: <
> > http://lucene.apache.org/core/6_2_1/analyzers-common/org/
> > apache/lucene/analysis/shingle/ShingleFilter.html>
> >
> > I’m not sure what you mean by “the analysis could be wrong if I add the
> > POS tags as an input in Lucene” - can you give an example?
> >
> > You may be interested in the work-in-progress addition of OpenNLP
> > integration with Lucene here: <https://issues.apache.org/
> > jira/browse/LUCENE-2899>
> >
> > --
> > Steve
> > www.lucidworks.com
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message