lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tri Cao <>
Subject Finding words not followed by other words
Date Fri, 11 Jul 2014 15:25:14 GMT
This is actually a tough problem in general: polysemy sense disambiguation. In your case, I
think it's more like you'll probably need to do some named entity resolution to differentiate
"George Washington" from "George Washington Carver" as they are two different entities.

Do you have a list of all the entity names in your corpus (either manually curated or by some
pattern matching?). If you do, one thing you can do is to write a tokenizer that emit one
token for each entity. So, for example, "George Washington" string emits a token like _George_Washington_, "George
Washington Carver" emits _George Washington_Carver_, etc.

There are some open source NLP library that has does this, but the quality varies, as it will
most likely depend on your domain and training data set.

Hope this helps,

On Jul 11, 2014, at 07:20 AM, Michael Ryan <> wrote:

I'm trying to solve the following problem...

I have 3 documents that contain the following contents:
1: "George Washington Carver blah blah blah."
2: "George Washington blah blah blah."
3: "George Washington Carver blah blah blah. George Washington blah blah blah."

I want to create a query that matches documents 2 and 3, but not 1. That is, I want to find
documents that mention "George Washington". It's okay if they also mention "George Washington
Carver", but I don't want documents that only mention "George Washington Carver". So simply
doing something like this does not solve it:
"George Washington" NOT "George Washington Carver"

Is there a Query type that does this out of the box? I've looked at the various types of span
queries, but none of them seem to do this. I think it should be theoretically possible given
the position data that Lucene stores...


  • Unnamed multipart/alternative (inline, None, 0 bytes)
    • Unnamed multipart/related (inline, None, 0 bytes)
View raw message