nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Thoughts on Parser design and dependencies
Date Fri, 18 Aug 2006 21:28:16 GMT
Sami Siren wrote:
> Andrzej Bialecki wrote:
>> Jukka Zitting wrote:
>>
>>> The Parser interface is also bound to the ideas of fetching content
>>> from the network and indexing it using a standard content model
>>> through the Content and Parse dependencies. For the Tika project I'd
>>> like to look for ways to generalize this, as neither of these ideas
>>> apply for example to the needs of the Apache Jackrabbit project. My
>>> TextExtractor proposal avoids these dependencies by using just a
>>> binary stream, a content type and an optional character encoding to
>>> produce a single text stream, but that approach fails to support more
>>> structured index content models. I'm trying to find a solution that
>>> combines the best parts of both approaches.
>>
>> A very important aspect of the Parser interface (or actually, the 
>> Parse and Content classes) is that they each may contain arbitrary 
>> metadata. This is required for discovering and passing around both 
>> the original metadata (such as protocol headers, document properties, 
>> etc), and other secondary content (such as data from external 
>> sources, or derived metadata).
>>
>> Simply returning a String doesn't cut it. Returning a java.util.Map 
>> may be an option, if you use standard Metadata constants as keys - 
>> still, Nutch would have to repackage this anyway into a Writable. And 
>> we would lose a nice property of the current Metadata class, which is 
>> the ability to tolerate minor syntax variations and to store multiple 
>> values per key.
>>
> The tolerance for syntax variations should instead of written into 
> meta data object be in a separate class perhaps implemented as a 
> decorator to actual meta data. In fact places where nutch needs to 
> take advantage of this functionality (actually in case of http headers 
> only??) are rarer (in number) than those where we know exactly the 
> names of meta data keys (because we put them there).
>
> I'd +1 if we'd go for a Map as a interface to meta data and in the 
> same time perhaps change the Crawldb's metadata to the same meta data 
> implementation or subclass of it.

Hmm. Please keep in mind that we need to use a Writable, both for the 
Map itself and also for every value that we put there. I'm worried that 
this could lead to excessive re-packaging of all objects coming out of 
Parsers, from their original formats (Map<String, String>) to MapWritable.

Since the goal here is to get rid of dependencies on Nutch or Hadoop, 
this means that Nutch will have to do such conversion because Tika would 
not support Writable.

>
> Perhaps we could even go for Map<String,String> or is there actually 
> some use case for having multiple values for single key?

Original motivation for this was http headers and meta tags, which can 
have multiple values. Another case is the language identification, where 
the same key may have multiple values, coming from different sources. 
Additionally, MapWritable supports any Writable, which is quite handy to 
store non-string data and to avoid converting to/from strings.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Mime
View raw message