nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jérôme Charron" <>
Subject [Proposal] New Lucene sub-project
Date Fri, 07 Apr 2006 08:26:54 GMT
Hi all,

While chatting with Chris Mattmann, it seems to be evident to us that there
is a need for a new sub-project within Lucene.

For now, Lucene's sub-projects used in Nutch are :
1. Lucene-java - The basis for search technology
2. Hadoop - The distributed computing platform
3. Nutch - The search engine that relies on Lucene and Hadoop.

Since Nutch contains some value added pieces of code that focus on content
we think it would be a good idea to split Nutch into a new sub-project based
on content analysis
manipulation. The components we have identified are :

1. MimeType Repository
2. Language Identifier
3. Content Signature (MD5Signature / TextProfileSignature / ...)
(4. Generic Meta Data Infrastructure)
(5. Charset Detector)
(6. Parse Plugins Framework)

The idea is to expose these pieces of codes into a standalone lib, since we
are convinced they could be usefull
in many other projects than Nutch.
The benefits will be to have some code more widely used / tested /
If this proposal is accepted, we have a candidate name for this new project:
Tika (comes from my son  ;-) )

Any comment is welcome.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message