lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "chris.b" <>
Subject Basic Named Entity Indexing
Date Wed, 12 Dec 2007 09:44:52 GMT

I'm not even sure if it can be considered Named Entity Recognition, but what
the hell...

so here's my problem...
I was asked to retrieve a the named entities out of a collection of
documents, and I've thought of two ways of doing so (not sure if either of
them work)...
a) index the documents by wrapping the whitespace analyzer with
ngramanalyzerwrapper and then retrieving only the words which have 3 or more
characters and start with a capital, filtering the "garbage" manually.
b) creating my own analyzer which will only index ngrams that start with
capital letters and then retrieving the indexed words.

which would be the best solution (i know that neither of them is very
correct, but it's supposed to be a quick fix :), and if it's the second one,
how would i go about creating my own analyzer? (i've read lucene in action
and it wasn't much help :s)

Thanks in advance, 
View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message