nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <chris.mattm...@jpl.nasa.gov>
Subject Re: Tika update
Date Wed, 16 Aug 2006 14:22:49 GMT
Hi Jukka,

 Thanks for your email. Indeed, there was discussion on the Lucene PMC email
list, about the Tika project. It was decided by the powers that be to
discuss it more on the Nutch mailing list before moving forward with any
vote on making Tika a sub-project of Apache Lucene. With regards to that, my
action was to send the Tika proposal to the nutch-dev list, and help to
start up a discussion on Tika, to get feedback from the community. Seeing as
though you lighted the fire under this (thanks!), it's only appropriate for
me to send out the Tika project proposal sent to the Lucene PMC. So, here it
is, attached. I'd love to here feedback from the Nutch community on what it
thinks of such a project.

Cheers,
   Chris



On 8/16/06 4:06 AM, "Jukka Zitting" <jukka.zitting@gmail.com> wrote:

> Hi,

There was recently discussion on perhaps starting a new
> Lucene
sub-project, named Tika, to create a general-purpose library from
> the
parser components and other features in Nutch that might interest a
wider
> audience. To keep things rolling we've created a temporary
staging area for
> the project at http://code.google.com/p/tika/ on
Google Code, and I've started
> to flesh out a potential project
structure using Maven 2.

Note that the
> project materials in svn refer to the project as "Apache
Tika" even though the
> project has *not* been officially accepted. The
reason for this is   that the
> Google Code project is just a temporary
staging ground and I wanted to give a
> better idea of what the project
could look like if accepted. The jury is still
> out on whether to start
a project like this, so any comments and feedback on
> the idea are very
much welcome.

Most, if not all, code in Tika will be based
> on existing code from
Nutch and other Apache projects, so I'm not sure if the
> project needs
to go through the Incubator if accepted by the Lucene PMC.

So
> far the tika source tree contains just a modified version of my
TextExtractor
> code from the Apache Jackrabbit project, and Jérôme is
planning to add some of
> his stuff. The source tree at Google Code
should be considered just a
> playground for bringing things together
and discussing ideas, before migrating
> back to ASF infrastructure.

BR,

Jukka Zitting

-- 
Yukatan -
> http://yukatan.fi/ - info@yukatan.fi
Software craftsmanship, JCR consulting,
> and Java development



Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message