nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Nutch dev. plans
Date Sun, 26 Jul 2009 16:09:45 GMT
Kirby Bohling wrote:

> I think you're correct about it being worth while.  I've got a git
> repository that I use for my work, I'll see about setting up a github
> and start to use that as a public place to get some of my stuff so you
> can see it.  Unfortunately, I have some proprietary stuff that I can't
> contribute back (most of which you don't want anyways).  I do have
> bugfixes for core issues that I do have permission to contribute.
> It'd be much easier for me to use Git to migrate the work back and
> forth between work and there.  It's also much smoother for me to
> develop a series of "easy to review" patches using it.

This is ok at this early stage - although sooner or later the patches 
need to appear in JIRA and be submitted with a grant of the ASL license.

>>> I'm guessing that Tika isn't ready for this.  Given that it's an
>>> Apache and/or Lucene project, it can probably be addressed.  My guess
>>> is that a number of the libraries they depend upon won't be.
>> I think we would like Tika to function as an OSGI plugin (or a group of
>> plugins?) out of the box so that we could avoid having to wrap it ourselves.
> I think Tika as one plugin would lead to a charge of "bloat", given
> all the formats it currently supports that you now ship as plugins.

The cumulative weight of our plugins is also significant.

> Long term do you see Nutch just supporting everything Tika does "out
> of the box" and including all of the dependencies.  Thus folding most
> of the parser plug-ins into one.  My understanding is that Tika is
> nothing more then a port of the Nutch library into a single unified,
> and re-usable library.  We might need help/support from Tika if the
> answer is to split them up.

IMHO it would be good to include all parsers, but provide a mechanism 
for a la carte configuration of active parsers, and a mechanism for 
using other parsers packaged as OSGI plugins instead of the Tika ones.

> I'd love to help.  I've mostly fought along the edges of this problem,
> rather then worked on it directly.  I've written an OSGi service or
> two, but I'm not sure it correctly handled all of the lifecycle issues
> and other critical details.
> I've played with your current system, and I know you'll have problems
> with OSGi, pretty much straight out of the box.  I wanted a docx
> parser, so I upgraded to Tika 0.3 and packaged the latest POI jars in
> a new plug-in, and I had pretty much exactly the problem I described
> with Class.forName() with the current plug-in system, because Tika
> uses Class.forName().  Tika was in the core class-loader, and the
> classes I needed where only in my docx plugin (core can't see system
> plugins).  So Tika 0.3 couldn't find them.  There are also a couple of
> small bug fixes for core in the API that I have, that it'd be nice to
> see get integrated, then we could upgrade to Tika 0.3 at least.

Tika is already at 0.4, maybe some things changed.

> I'll go hack on this tonight and tomorrow and see where I get.  I
> think it's likely that Tika (or the dependent libraries), will need
> significant work on packaging and the like.  I'm assuming that Felix
> is the OSGi implementation you'd like to use by default?

No idea - I played shortly with both, the key being the word "played" .. 
;) Equinox has fewer dependencies if I'm not mistaken?

> I know somebody was fairly well along with this conversion 3-4 years
> ago.  Sami Siren is the name I associated with that.  Anybody know
> where all of that ended up?  If nothing else, the boiler plate Ant
> changes would be nice to have.
> (
> How do you feel about build system modifications?  It'd be much nicer
> to use OSGi in a toolchain where dependency resolution was done for
> us.  I've looked at Ivy, but I couldn't seem to get it working.  The
> documentation and tutorials was just a bit terse, and I know how to
> deal with Maven.  I use Maven at my work all the time.  When it works
> it's glorious, when you've hit a bug, it can be a show stopper.
> However, I know for a lot of folks it is a non-starter.

I acknowledge that maven may be superior to ant at tracking dependencies 
... let's leave it at that ;)

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message