tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Boilerpipe is nice, but what about readability?
Date Sun, 02 Jan 2011 20:23:05 GMT
Hi Otis - thanks for the nudge.

Hi Benson - yes, something like this would be useful.

My personal preference for how to integrate things like this into Tika  
is to create a ContentHandler. Then it's trivial to use for extracting  
body content, and you can use the TeeContentHandler to add it in  
parallel

See BoilerpipeContentHandler in Tika for one example of this approach.  
Though that code got a bit messy when I changed it to support  
including markup.

-- Ken

On Jan 2, 2011, at 10:55am, Otis Gospodnetic wrote:

> Somehow this nice offer didn't seem to attract any responses -
> http://search-lucene.com/m/ZTMKyJXNR92
>
> +1 for this patch.
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message ----
>> From: Benson Margulies <bimargulies@gmail.com>
>> To: dev@tika.apache.org
>> Sent: Thu, November 4, 2010 9:02:10 AM
>> Subject: Boilerpipe is nice, but what about readability?
>>
>> I just coded a Java port of the arclabs 'readability' javascript  
>> code,
>> which  has a very strong reputation as a device for grabbing the  
>> useful
>> content from  newsy web pages.
>>
>> I could contribute it to Tika, if (a) you wanted it, and  (b) there  
>> was
>> some reasonable way to decide or configure which one to  use.
>>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Mime
View raw message