lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Ryan <mr...@moreover.com>
Subject RE: Finding words not followed by other words
Date Fri, 11 Jul 2014 16:54:42 GMT
Thanks for the response. I don't necessarily know ahead of time what entities will be searched
on, or even that things that are being searched for are actually truly entities. Though, for
a lot of cases where I do know what users want to search on, this could help...

-Michael

From: Tri Cao [mailto:tmcao@me.com]
Sent: Friday, July 11, 2014 11:25 AM
To: java-user@lucene.apache.org
Cc: java-user@
Subject: Finding words not followed by other words

This is actually a tough problem in general: polysemy sense disambiguation. In your case,
I think it's more like you'll probably need to do some named entity resolution to differentiate
"George Washington" from "George Washington Carver" as they are two different entities.

Do you have a list of all the entity names in your corpus (either manually curated or by some
pattern matching?). If you do, one thing you can do is to write a tokenizer that emit one
token for each entity. So, for example, "George Washington" string emits a token like _George_Washington_,
"George Washington Carver" emits _George Washington_Carver_, etc.

There are some open source NLP library that has does this, but the quality varies, as it will
most likely depend on your domain and training data set.

Hope this helps,
Tri

On Jul 11, 2014, at 07:20 AM, Michael Ryan <mryan@moreover.com<mailto:mryan@moreover.com>>
wrote:
I'm trying to solve the following problem...

I have 3 documents that contain the following contents:
1: "George Washington Carver blah blah blah."
2: "George Washington blah blah blah."
3: "George Washington Carver blah blah blah. George Washington blah blah blah."

I want to create a query that matches documents 2 and 3, but not 1. That is, I want to find
documents that mention "George Washington". It's okay if they also mention "George Washington
Carver", but I don't want documents that only mention "George Washington Carver". So simply
doing something like this does not solve it:
"George Washington" NOT "George Washington Carver"

Is there a Query type that does this out of the box? I've looked at the various types of span
queries, but none of them seem to do this. I think it should be theoretically possible given
the position data that Lucene stores...

-Michael


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message