nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Antwort: Re: parse-plugins.xml
Date Fri, 04 Aug 2006 14:50:42 GMT
marcel.schnippe@provinzial.com wrote:
>>> to index, then we may think of either (a) removing it from the default
>>>       
>> +1
>>     
> -1
> This is not the right way. Better keep parse-text as default parser. But 
> do not
> fall back to parse-text automatically, when the custom parser fails. The
> custom parser (PDF in this case) can choose itself to retry with 
> parse-text.
>   

... which involves the first step, which we just did, i.e. removing 
parse-text from parse-plugins.xml entry for PDF. The enhancement that 
you propose makes sense of course, but it's the next step. Would you 
like to prepare a patch for this?


> +1, this wont work
>
> But what about: C) A better default parser is needed. It could
> determine if the anaylsed bytestream is statistically human language and 
>   

Of which humans? statistical profiles for, say, Devanagari, Kanji, 
Arabic and Latin are somewhat different ...

> decide
> to use the bytstream, or to drop it if its binary. Maybe it could filter 
> language
> words contained in binary data (like .exe) .
>   

What you probably mean is something equivalent to Unix strings(1). I 
have a plugin that implements this, which I could contribute if there's 
interest.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Mime
View raw message