nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kirby Bohling <>
Subject Re: Nutch dev. plans
Date Sat, 25 Jul 2009 23:11:40 GMT
Comments inline below:

On Sat, Jul 25, 2009 at 2:23 PM, Andrzej Bialecki<> wrote:
> Kirby Bohling wrote:
>> On Fri, Jul 17, 2009 at 5:21 PM, Andrzej Bialecki<> wrote:
>>> Doğacan Güney wrote:
>>>>> There's no specific design yet except I can't stand the existing plugin
>>>>> framework anymore ... ;) I started reading on OSGI and it seems that
>>>>> supports the functionality that we need, and much more - it certainly
>>>>> looks
>>>>> like a better alternative than maintaining our plugin system beyond 1.x
>>>>> ...
>>>> Couldn't agree more with the "can't stand plugin framework" :D
>>>> Any good links on OSGI stuff?
>>> I found this:
> Hi Kirby,
> Thanks for your insights - please see my comments below.
>> Plugins are called Bundles in OSGi parlance, but I'll use plugin as
>> that's the term used by Nutch.
>> I have done quite a bit of OSGi work (I used to develop RCP
>> applications for a living).  OSGi is great, as long as you plan on not
>> using reflection to retrieve classes directly, and you don't plan on
>> using a library that uses it directly.
>> Pretty much every use of usage like this:
>> Class<?> clazz = Class.forName(stringFromConfig);
>> // Code to create an object using this class...
>> Will fail, unless the code is very classloader aware.  So if you're
>> going to switch over to using OSGi (which I think would be wonderful),
>> you'll want to ensure that you can deal with all of the third-party
>> libraries.  I haven't played much with any of the Declarative Services
>> stuff (I think that was slated for OSGi, but it might have just been
>> an Eclipse extension).
> This is an important issue - so I think we need first to do some
> experiments, and continue development on a branch for a while ... Still the
> whole ecosystem that OSGI offers is worth the trouble IMHO.

I think you're correct about it being worth while.  I've got a git
repository that I use for my work, I'll see about setting up a github
and start to use that as a public place to get some of my stuff so you
can see it.  Unfortunately, I have some proprietary stuff that I can't
contribute back (most of which you don't want anyways).  I do have
bugfixes for core issues that I do have permission to contribute.
It'd be much easier for me to use Git to migrate the work back and
forth between work and there.  It's also much smoother for me to
develop a series of "easy to review" patches using it.

>> The OSGi uses classloader segmentation to allow multiple conflicting
>> versions of the same code inside the same project.  So having a
>> pattern like:
>> Plugin A: nutch.api (Which contains say the interface Parser { })
>> Plugin B: parser.word (which has class WordParser implements Parser)
>> Plugin B has to depend on Plugin A so it can see the parser.  In this
>> case, Plugin A can't have code that uses Class.forName("WordParser");
>> OSGi changes the default classloader delegation, you can only see
>> classes in plugins you depend upon, and cycles in the dependencies are
>> not allowed.
> If I understand it correctly, this is pretty much how it's supposed to work
> in our current plugin system ... only it's more primitive and it's got some
> warts ;)

That's a fair and accurate statement.

>> If you want to do that, you end up having to do:
>> ClassLoader loader = ParserRegistery.lookupPlugin("WordParser");
>> Class.forname("WordParser", loader);
>> OSGi has some SPI-like way way to have a plugin note the fact that it
>> contributes an implementation of the Parser interface.  Eclipse builds
>> on top of it, and that's what Eclipse 3.x implemented the
>> Extension/ExtensionPoint system on top of.  I believe they are called
>> services in "raw" OSGi.
>> It's not a huge deal to write that yourself for API's you implement.
>> The problem is that it can be difficult to integrate really useful
>> third-party libraries that don't account for this change in
>> classloader behaviour.  At points it can make it very problematic to
>> use a specific XML parser that has the features you want (or some
>> library you want to use really wants).  Because they do this sort of
>> thing all the time.
> This doesn't sound too much different from what we do already in Nutch
> plugins.

Yes.  I think that's accurate.

>> I'm guessing that Tika isn't ready for this.  Given that it's an
>> Apache and/or Lucene project, it can probably be addressed.  My guess
>> is that a number of the libraries they depend upon won't be.
> I think we would like Tika to function as an OSGI plugin (or a group of
> plugins?) out of the box so that we could avoid having to wrap it ourselves.

I think Tika as one plugin would lead to a charge of "bloat", given
all the formats it currently supports that you now ship as plugins.
Long term do you see Nutch just supporting everything Tika does "out
of the box" and including all of the dependencies.  Thus folding most
of the parser plug-ins into one.  My understanding is that Tika is
nothing more then a port of the Nutch library into a single unified,
and re-usable library.  We might need help/support from Tika if the
answer is to split them up.

>> You can use fragments to get away from that (a fragment requires a
>> host bundle, the fragment's classes are loaded using the same
>> classloader as the host), but it doing that defeats a lot of the
>> reason for using OSGi (at least in terms of allowing you to use
>> multiple conflicting libraries in the same application).
> Thank you again for the comments - I'm a newbie to OSGI, so I'll probably
> start with small experiments and see how it goes. If you think you could
> help us with this by providing some guidance or help with the design then
> that would be great.

I'd love to help.  I've mostly fought along the edges of this problem,
rather then worked on it directly.  I've written an OSGi service or
two, but I'm not sure it correctly handled all of the lifecycle issues
and other critical details.

I've played with your current system, and I know you'll have problems
with OSGi, pretty much straight out of the box.  I wanted a docx
parser, so I upgraded to Tika 0.3 and packaged the latest POI jars in
a new plug-in, and I had pretty much exactly the problem I described
with Class.forName() with the current plug-in system, because Tika
uses Class.forName().  Tika was in the core class-loader, and the
classes I needed where only in my docx plugin (core can't see system
plugins).  So Tika 0.3 couldn't find them.  There are also a couple of
small bug fixes for core in the API that I have, that it'd be nice to
see get integrated, then we could upgrade to Tika 0.3 at least.

I'll go hack on this tonight and tomorrow and see where I get.  I
think it's likely that Tika (or the dependent libraries), will need
significant work on packaging and the like.  I'm assuming that Felix
is the OSGi implementation you'd like to use by default?

I know somebody was fairly well along with this conversion 3-4 years
ago.  Sami Siren is the name I associated with that.  Anybody know
where all of that ended up?  If nothing else, the boiler plate Ant
changes would be nice to have.

How do you feel about build system modifications?  It'd be much nicer
to use OSGi in a toolchain where dependency resolution was done for
us.  I've looked at Ivy, but I couldn't seem to get it working.  The
documentation and tutorials was just a bit terse, and I know how to
deal with Maven.  I use Maven at my work all the time.  When it works
it's glorious, when you've hit a bug, it can be a show stopper.
However, I know for a lot of folks it is a non-starter.


> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>  Contact: info at sigram dot com

View raw message