lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Estrada <estrada.a...@gmail.com>
Subject Re: boilerpipe solr tika howto please
Date Fri, 14 Jan 2011 16:54:02 GMT
There is another way to ingest data using DIH. Check out the
HTMLStripTransformer

      <entity name="CDC"
        pk="link"
        datasource="filedatasource"
        url="http://www2c.cdc.gov/podcasts/createrss.asp?t=r&amp;c=19"
        processor="XPathEntityProcessor"
        forEach="/rss/channel | /rss/channel/item"
        transformer="DateFormatTransformer,HTMLStripTransformer">

        <field column="source"       xpath="/rss/channel/title"
commonField="true" />
        <field column="source-link"  xpath="/rss/channel/link"
 commonField="true" />
        <field column="subject"      xpath="/rss/channel/description"
commonField="true" />
        <field column="title"        xpath="/rss/channel/item/title" />
        <field column="link"         xpath="/rss/channel/item/link" />
        <field column="description"  xpath="/rss/channel/item/description"
stripHTML="true" />
        <field column="creator"      xpath="/rss/channel/item/creator" />
        <field column="item-subject" xpath="/rss/channel/item/subject" />
        <field column="author"       xpath="/rss/channel/item/author" />
        <field column="comments"     xpath="/rss/channel/item/comments" />
        <field column="pubdate"      xpath="/rss/channel/item/pubDate"
dateTimeFormat="EEE, dd MMM yyyy HH:mm:sss z" />
        <field column="dcdate"       xpath="/rss/channel/item/date"
dateTimeFormat="yyyy-MM-dd'T'HH:mm:sss'Z'" />
        <field column="lat"        xpath="/rss/channel/item/lat" />
        <field column="lng"        xpath="/rss/channel/item/long" />
      </entity>

On Fri, Jan 14, 2011 at 11:10 AM, arnaud gaudinat <arnaud.gaudinat@gmail.com
> wrote:

> I just saw TagSoup and it seems to clean bad HTML tags to create a good
> HTML file.
> what's BoilerPipe does, it try to eliminate html content which is not part
> of the useful content for a human reader (ie. navigation contents, ads,
> comments...)
> take a look here: http://boilerpipe-web.appspot.com/ and try with one of
> your URL
>
> And other type of this application, is 'Readability' which is more for a
> end-user (http://lab.arc90.com/experiments/readability/)
>
>
> Le 14.01.2011 16:51, Adam Estrada a écrit :
>
>  Is there a drastic difference between this and TagSoup which is already
>> included in Solr?
>>
>> On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat
>> <arnaud.gaudinat@gmail.com>wrote:
>>
>>  Hello,
>>>
>>> I would like to use BoilerPipe (a very good program which cleans the html
>>> content from surplus "clutter").
>>> I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from
>>> solr, am I right?
>>>
>>> How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml
>>> (
>>> with org.apache.solr.handler.extraction.ExtractingRequestHandler)?
>>>
>>> Or do I need to modify some code inside Solr?
>>>
>>> I so something like TikaCLI -F in the tika forum (
>>>
>>> http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration
>>> )
>>> is it the right way?
>>>
>>> Thanks in advance,
>>>
>>> Arno.
>>>
>>>
>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message