lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chong, Herb" <>
Subject RE: inter-term correlation [was Re: Vector Space Model in Lucene?]
Date Mon, 17 Nov 2003 13:53:25 GMT
i have a program written in Icon that does basic sentence splitting. with about 5 heuristics
and one small lookup table, i can get well over 90% accuracy doing sentence boundary detection
on email. for well edited English text, like newswires, i can manage closer to 99%. this is
all that is needed for significantly improving a search engine's performance when the query
engine respects sentence boundaries. incidentally, the GATE Information Extraction framework
cites some references that indicate that for named entity feature extraction, their system
can exceed the ability of trained humans to detect and classify named entities if only one
person does the detection. collaborating humans are still better, but no-one has the time
in practical applications.

you probably know, since you know about Markov chains, that within sentence term correlation,
and hence the language model, is different than across sentences. linguists have known this
for a very long time. it isn't hard to put this capability into a search engine, but it absolutely
breaks down unless there is sentence boundary information stored for use at query time.


-----Original Message-----
From: Andrzej Bialecki []
Sent: Friday, November 14, 2003 5:54 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

Well ... Sure, nothing can replace a human mind. But believe it or not, 
there are studies which show that even human experts can significantly 
differ in their opinions on what are key-phrases for a given text. So, 
the results are never clear cut with humans either...

So, in this sense a heuristic tool for sentence splitting and key-phrase 
detection can go long ways. For example, the application I mentioned, 
uses quite a few heuristic rules (+ Markov chains as a heavier 
ammunition :-), and it comes up with the following phrases for your 
email discussion (the text quoted below):

(lang=EN): NLP, trainable rule-based tagging, natural language 
processing, apache, NLP expert

Now, this set of key-phrases does reflect the main noun-phrases in the 
text... which means I have a practical and tangible benefit from NLP. 
QED ;-)

Best regards,

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message