lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen" <cdor...@gmail.com>
Subject Re: Basic Named Entity Indexing
Date Tue, 08 Jan 2008 20:51:41 GMT
Hi Chris,

A null pointer exception can be causes by not checking
newToken for null after this line:
    Token newToken = input.next()

I think Hoss meant to call next() on the input as long as returned
tokens do not satisfy the check for being a named entity.

Also, this code assumes white space in the token - which you won't
have since using a WhiteSpaceAnalyzer.

For returning single word names I think something like this should work:

   Token t;
   while ((t = in.next())!=null   && !
Character.isUpperCase(t.termText().getCharAt(0)))
{
   }
   return t;

For identifying two consecutive token starting with an upper case character
and returning them as a single name a bit more code is required.

Btw, I don't understand why the NGram.

HTH, Doron

On Jan 8, 2008 5:05 PM, chris.b <omelhornomedomundo@gmail.com> wrote:

>
> Following your suggestion (I think), I built a tokenfilter with the
> following
> code for next():
>
>        public final Token next() throws IOException {
>                Token newToken = input.next();
>                termText = newToken.termText();
>                Character tempChar = termText.charAt(0);
>                if(Character.isUpperCase(tempChar)) {
>                        for(int current = 0; current < termText.length();
> current++){
>                                Character currentChar = termText.charAt
> (current);
>                                if (Character.isWhitespace(currentChar) &
> Character.isUpperCase(currentChar + 1) & current != termText.length()) {
>                                        return newToken;
>                                }
>                        }
>                }
>                return null;
>        }
>
> -----------
> and in calling this filter, i'm also calling NGramAnalyzerWrapper wrapping
> WhitespaceAnalyzer (these two work together), but when building my index i
> get the following error:
>
> Exception in thread "main" java.lang.NullPointerException
>        at rem.NamedEntityTokenFilter.next(NamedEntityTokenFilter.java:21)
>        at
> org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java
> :219)
>        at
> org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:95)
>        at
> org.apache.lucene.index.IndexWriter.buildSingleDocSegment(IndexWriter.java
> :1013)
>        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java
> :1001)
>        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java
> :983)
>        at ancorpMethods.Handlers.handleDOC(Handlers.java:92)
>        at ancorpMethods.Handlers.handleDir(Handlers.java:32)
>        at ancorpMethods.Handlers.handleDir(Handlers.java:30)
>        at ancorpMethods.Handlers.handleDir(Handlers.java:30)
>        at ancorpMethods.Handlers.handleDir(Handlers.java:30)
>        at ancorpMethods.Handlers.handleDir(Handlers.java:30)
>        at Base.Indexer.indexCapitalNgrams(Indexer.java:155)
>        at Base.Indexer.main(Indexer.java:81)
>
> ----------
> am I forgetting something or am I going the wrong way? :|
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message