nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Update on Integration with Tika
Date Tue, 17 Nov 2009 15:21:47 GMT
Julien Nioche wrote:
>     Well ... let's consider this: in the past we used to put things
>     under /lib/ when they were being used by more than a few plugins.
>     Then we started using library-only plugins (e.g. lib-xml,
>     lib-nekohtml, etc). There is a mechanism that allows us to export
>     any classes from a plugin so that they are visible to the rest of
>     the framework.
>     It looks to me like we could be better off by putting all parts of
>     Tika in a single plugin, and then in Nutch core use a new extension
>     point just for the purpose of mimetype detection. This facade
>     (MimeDetectors) would use the Tika plugin if available, or some
>     other (null?) mechanism otherwise. At the same time Tika would be
>     happy to configure itself having all tika-core and parsers available
>     under the same classloader, and it would define two extension points
>     - one for mimetype detection, and another for parsing. What do you
>     think?
> I haven't looked yet at the way extension points work, so I don't really 
> have an idea on how difficult this would be. Some of Tika's classes 
> (mostly MimeType) are used explicitly in several places of the core, 
> would we need to hide them behind non Tika objects in order not to have 
> direct dependencies?

Yes, that was my idea.

> I suppose we could try to make progress on the Tika plugin as it is now 
> (i.e. with the work around I described earlier) and refactor things in a 
> later stage using the extension points. Makes sense?

We could, but if we can figure out a cleaner solution now, then we 
should follow it instead of committing that workaround and then having 
to refactor it ...

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message