Hi Abe-san,

Is this capability a configurable function of Tika?  We could add Tika configuration to the Tika Extractor if so.

Karl

On Thu, Oct 23, 2014 at 2:03 AM, Shinichiro Abe <shinichiro.abe.1@gmail.com> wrote:
Hi Arcadius,

> - use Tika's BoilerPipe to get cleaner content from web sites?
Yes, Tika extractor will remove tags in html
and send content and metadata to downstream pipeline/output connection.

> - What about extracting specific HTML tags such as all h1 or h2 and map them to a Solr field?
No, currently it can map only metadata which is extracted by Tika to Solr field.
For h1, h2, p tags etc,  Tika extractor doesn't capture them and doesn't treat them as metadata.
Currently when capturing these tags and map them to fields,
we have to use Solr's ExtractingRequestHandler(CAPTURE_ELEMENTS param).

Regards,
Shinichiro Abe

On 2014/10/23, at 10:21, Arcadius Ahouansou <arcadius@menelic.com> wrote:

>
> Hello.
>
> Given that we now have pipelines in ManifoldCF, How feasible  is it to:
>
> - use Tika's BoilerPipe to get cleaner content from web sites?
> - What about extracting specific HTML tags such as all h1 or h2 and map them to a Solr field?
>
> Thank you very much.
>
> Arcadius.
>