nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Thoughts on Parser design and dependencies
Date Fri, 18 Aug 2006 18:40:48 GMT
Jukka Zitting wrote:

> The Parser interface is also bound to the ideas of fetching content
> from the network and indexing it using a standard content model
> through the Content and Parse dependencies. For the Tika project I'd
> like to look for ways to generalize this, as neither of these ideas
> apply for example to the needs of the Apache Jackrabbit project. My
> TextExtractor proposal avoids these dependencies by using just a
> binary stream, a content type and an optional character encoding to
> produce a single text stream, but that approach fails to support more
> structured index content models. I'm trying to find a solution that
> combines the best parts of both approaches.

A very important aspect of the Parser interface (or actually, the Parse 
and Content classes) is that they each may contain arbitrary metadata. 
This is required for discovering and passing around both the original 
metadata (such as protocol headers, document properties, etc), and other 
secondary content (such as data from external sources, or derived metadata).

Simply returning a String doesn't cut it. Returning a java.util.Map may 
be an option, if you use standard Metadata constants as keys - still, 
Nutch would have to repackage this anyway into a Writable. And we would 
lose a nice property of the current Metadata class, which is the ability 
to tolerate minor syntax variations and to store multiple values per key.

> Ideally I'd like to see a parser implementation in Tika that avoids
> the Nutch dependencies but can still be used in Nutch without changing
> any of the existing code or configuration files. Something like a
> TikaParser adapter class might be needed to achieve that.

It seems to me that such adapter is unavoidable. Most probably similar 
adapters would have to be used for all other dependencies (Configurable 
etc). The big question is how to minimize the intermediate object 
creation, and to come up with interfaces that are robust enough to 
support all current usecases in Nutch, but at the same time don't 
introduce too many layers of delegation...

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message