manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arcadius Ahouansou <arcad...@menelic.com>
Subject Re: Extracting Content from Web Crawler using the new PipeLine
Date Thu, 23 Oct 2014 15:57:01 GMT
Hello Abe-San.

Thank you for the response.

The BoilerPipe library I was referring to helps to remove common/repetitive
page components such as menu items, headings, footers etc from the crawled
content.

There is a Solr Patch at
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SOLR-3808

That I have been using.
Thought it would be good to have Manifold do this instead.

It would also be interesting to have Manifold able to extract content of
html tags such as div, h1,... like Solr.

Thanks
On 23 Oct 2014 07:03, "Shinichiro Abe" <shinichiro.abe.1@gmail.com> wrote:

> Hi Arcadius,
>
> > - use Tika's BoilerPipe to get cleaner content from web sites?
> Yes, Tika extractor will remove tags in html
> and send content and metadata to downstream pipeline/output connection.
>
> > - What about extracting specific HTML tags such as all h1 or h2 and map
> them to a Solr field?
> No, currently it can map only metadata which is extracted by Tika to Solr
> field.
> For h1, h2, p tags etc,  Tika extractor doesn't capture them and doesn't
> treat them as metadata.
> Currently when capturing these tags and map them to fields,
> we have to use Solr's ExtractingRequestHandler(CAPTURE_ELEMENTS param).
>
> Regards,
> Shinichiro Abe
>
> On 2014/10/23, at 10:21, Arcadius Ahouansou <arcadius@menelic.com> wrote:
>
> >
> > Hello.
> >
> > Given that we now have pipelines in ManifoldCF, How feasible  is it to:
> >
> > - use Tika's BoilerPipe to get cleaner content from web sites?
> > - What about extracting specific HTML tags such as all h1 or h2 and map
> them to a Solr field?
> >
> > Thank you very much.
> >
> > Arcadius.
> >
>
>

Mime
View raw message