lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Re: boilerpipe solr tika howto please
Date Fri, 14 Jan 2011 17:04:45 GMT
Hi Arno,

On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote:

> Hello,
> I would like to use BoilerPipe (a very good program which cleans the  
> html content from surplus "clutter").
> I saw that BoilerPipe is inside Tika 0.8 and so should be accessible  
> from solr, am I right?
> How I can Activate BoilerPipe in Solr? Do I need to change  
> solrconfig.xml ( with  
> org.apache.solr.handler.extraction.ExtractingRequestHandler)?
> Or do I need to modify some code inside Solr?
> I so something like TikaCLI -F in the tika forum (

> ) is it the right way?

You need to add the BoilerpipeContentHandler into Tika's content  
handler chain.

Which I'm pretty sure means you'd need to modify Solr, e.g. (in trunk)  
the TikaEntityProcessor.getHtmlHandler() method. I'd try something like:

	return new BoilerpipeContentHandler(new ContentHandlerDecorator(....

Though from a quick look at that code, I'm curious why it doesn't use  
BodyContentHandler, versus the current ContentHandlerDecorator.

-- Ken

Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g

View raw message