tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting" <jukka.zitt...@gmail.com>
Subject Re: Questions
Date Fri, 29 Jun 2007 22:36:15 GMT

On 6/29/07, Grant Ingersoll <gsingers@apache.org> wrote:
> I was wondering if you had a todo list or something somewhere?  I
> have been loosely following the discussions here and see the general
> outline of what the goals are here: http://www.mail-archive.com/tika-
> dev@incubator.apache.org/msg00024.html (Tika discussions in Amsterdam)

That's probably the most complete todo list lookalike for now. There's
some gradual progress going on, but we are still in a formative phase
where not even some basic practices on svn use, etc. have emerged, so
I wouldn't put too much weight on any single message

> Here's where I am at:  I am considering extracting the Nutch parsing
> plugins for a project I am undertaking and wrapping them for my own
> purposes, but knowing Tika is around, I would just as soon do this in
> the context of Tika, or at least try to help out that way and have it
> become a part of Tika.  I have not looked at Lius yet.  I guess I am
> wondering if you have some interfaces in mind that you want to hook
> into, or is the Nutch model (or Lius model) already going to serve as
> the main model?  I pretty much think the Nutch model has everything I
> need at the moment, but I don't want to carry around the whole set of
> Nutch dependencies.  I am not worried about content detection at this
> point so much as extraction.
> Is the plan to adopt a similar plugin approach as Nutch?

There seems to be a general consensus that the existing solutions like
Nutch are a good starting point but need some modifications before
they satisfy all the goals of Tika, but few specific decisions have
yet been made.

> So, I guess the question is what can I do at this point to help?
> Should I just go ahead with my needs and then give it back as a patch
> and you can decide what to do with it from there?  I  am in somewhat
> of a hurry to get the basics working in the next week or so.

I would recommend that you just go forward with your plan and don't
wait for us. :-) One thing you may want to take a look at is "Lius
Lite" in the Tika issue tracker, that contains a trimmed version of
the Lius framework, but if you already are familiar with Nutch then it
probably makes more sense to stick with that. I believe the eventual
Tika framework will end up incorporating concepts from both Nutch and
Lius (among others).

It would be certainly interesting to see what you end up with and
perhaps hear a brief summary of the main issues and concerns you
encountered. This is exactly the sort of stuff that Tika should
support, so your contributions would be very much welcome!


Jukka Zitting

View raw message