lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Catogarization is possible in Lucene?
Date Fri, 26 Jun 2009 17:22:29 GMT
It sounds to me that what you are trying to do is information extraction.

Lucene can use the output of such a system, but does not support doing the
extraction itself.  Many times, full scale named entity extraction is not
really necessary and in those cases the phrase query in Lucene can help you
out.  For instance, you might search for documents that have the string
"Authors:" within 10 words of a particular name.  That will only retrieve
documents, however, and would not, say, fill in an author table in a
database.  You can help such a system by doing simple pre-processing during
initial document processing and such a system can help in doing information
extraction by finding documents that are likely to contain the information
you need to extract.

I would recommend you look at the GATE system (if you want open source) or
Lingpipe (if you can pay commercial prices or are doing research).

On Fri, Jun 26, 2009 at 5:14 AM, Harsha1 <> wrote:

> Hi,
> I went through the overview of Lucene and found its somewhat related to
> text
> searching and other stuffs.
> Please let me know if following can be done.
> Suppose i have a paragraph,
> This is test program. I have done this using regex and some other function
> in groovy. But what I am looking is some kind of feature or template or
> anything wherein I just mention the pattern in which i am interested in.
> Based on the pattern mention groovy should automatically categorize the
> fields.  Authors: Micheal Jackson, Daniel O Reily and Harsha.
> Format we are looking at is,
> In this case,
> TITLE = Authors,
> NAME1 = Micheal Jackson
> NAME2 = Daniel O Reily
> NAME3 = Harsha
> Like this, When i pass some paragraph, these fields(TITLE: NAME1 NAME2
> NAME3) categorized automatically. Is it possible? (I have done in java
> using
> Regular expression, but we dont want to code from scratch, we want some
> features from language will automatically do this. or with less code)
> --
> View this message in context:
> Sent from the Lucene - General mailing list archive at

Ted Dunning, CTO

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
858-414-0013 (m)
408-773-0220 (fax)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message