lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From arnaud gaudinat <arnaud.gaudi...@gmail.com>
Subject Re: boilerpipe solr tika howto please
Date Mon, 17 Jan 2011 11:17:00 GMT
Thanks Ken,
this what I wanted to know, I'm not very familiar with this kind of 
modification. However, I will try to do it and ask you some information 
in case of need.
regards,

Arno

Le 14.01.2011 18:04, Ken Krugler a écrit :
> Hi Arno,
>
> On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote:
>
>> Hello,
>>
>> I would like to use BoilerPipe (a very good program which cleans the 
>> html content from surplus "clutter").
>> I saw that BoilerPipe is inside Tika 0.8 and so should be accessible 
>> from solr, am I right?
>>
>> How I can Activate BoilerPipe in Solr? Do I need to change 
>> solrconfig.xml ( with 
>> org.apache.solr.handler.extraction.ExtractingRequestHandler)?
>>
>> Or do I need to modify some code inside Solr?
>>
>> I so something like TikaCLI -F in the tika forum 
>> (http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration)

>> is it the right way?
>
> You need to add the BoilerpipeContentHandler into Tika's content 
> handler chain.
>
> Which I'm pretty sure means you'd need to modify Solr, e.g. (in trunk) 
> the TikaEntityProcessor.getHtmlHandler() method. I'd try something like:
>
>     return new BoilerpipeContentHandler(new ContentHandlerDecorator(....
>
> Though from a quick look at that code, I'm curious why it doesn't use 
> BodyContentHandler, versus the current ContentHandlerDecorator.
>
> -- Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>


Mime
View raw message