lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Carpenter <>
Subject Re: Keyphrase Extraction
Date Tue, 08 May 2007 21:25:08 GMT
Mark Miller wrote:
> The only commercial options that I have seen do not have a web presence 
> (that I know of or can find) and I don't recall the company names (only 
> peripherally involved).

Are we talking about Yahoo's buzz index and
Amazon's SIPs or CAPs?

I actually think the most interesting application
of this is in the search engine, built by
Fast Search (Lucene competitor) and Elsevier (publisher).
They extract phrases relative to a query (I have guesses
as to how they do this quickly) and show those for
query refinement.  For instance, the query "text data
mining" finds the following "keywords" (and then some):

	association rules	
	case-based reasoning	
	computational biology	
	data integration	
	data visualization	
	information access	
	information filtering	
	information integration

The standard way to tackle this problem (see, e.g.,
Manning and Schuetze's 1999 NLP textbook) is to
look for collocations -- terms that don't look to
be random according to standard independence tests
(e.g. a t-test or chi-squared test).  That is,
do "data" and "visualization" show up more than
you would expect them to in the results of the
query "text data mining"?)

Although Manning and Schuetze
don't really discuss it, you can also compare one
corpus to another (e.g. today's news to the last
month's to see what's newly hot today, or the top
1000 hits for a query relative to a whole collection).

You can find pretty much every version ever
put forward implemented in Ted Pedersen's
n-grams package:

which is in Perl with lots of doc and manuals
and papers with all the (very easy) math.

These techniques are also very very easy to implement,
as in first exercise in an undergrad computer sci
class easy.  The only real issues are (a) scaling
and (b) heuristic pruning.  Popular pruning options
include using only nouns (as determined by a part
of speech tagger), only capitalized phrases,
or even phrases appearing after "the".  With enough
pruning, scaling's easy.

We provide a tutorial in LingPipe:

And here's a blog entry comparing our hypothesis
testing approach to a standard mutual-info based
method (discussed by Matthew Hurst, when he was
at Nielsen BuzzMetrics):

- Bob Carpenter

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message