nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Update on Integration with Tika
Date Mon, 16 Nov 2009 20:00:14 GMT
Julien Nioche wrote:
> Hi,
> I came across the classloader issue that you mentioned but got 
> everything to work OK by duplicating the class TikaConfiguration into 
> the package used by my plugin. The lib tika-core goes into the main /lib 
> dir of nutch while tika-parsers jar goes into the lib dir of the plugin. 
> I now have a first version of the Tika plugin which does some very basic 
> text and metadata extraction.

This is confusing. Could you please explain why various Tika parts need 
to be put in different places? Also, the word "duplication" raises a red 
flag ...

> What shall we do about the HTMLParseFilters? Get the generic TikaParser 
> to create a DOM representation and pass it to the HTMLParseFilters as it 
> is done now? Modify the HTMLParseFilters so that they use SAX events so 
> that we can forward them from Tika? Any other suggestions?

The benefit of using DOM tree in HTMLParseFilters is that it's easier to 
extract / remove parts of the tree without keeping track of the context, 
which is the most complicated part of working with SAX - this context 
tracking would have to be reimplemented in many plugins ... The downside 
is of course the memory footprint - but we do limit the max size of the 
documents elsewhere (in the protocol plugins). So I'd vote to keep using 
DOM for now.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message