tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aleksei Udalov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-2610) Extend HtmlMapper isDiscardElement method with Attributes parameter
Date Mon, 19 Mar 2018 11:12:00 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aleksei Udalov updated TIKA-2610:
---------------------------------
    Description: 
Currently, if we want to discard HTML elements by attribute value/existence, an example from
one of our projects
{code:html}
<div data-meta-no-index>Some content to be ignored by custom search indexer (Tika parser)</div>
{code}
it's required to implement a custom handler with logic very similar to what we have in org.apache.tika.parser.html.HtmlHandler.
While it can be easily done by keep using HtmlHandler, but setting an instance of HtmlMapper
with (newly added) isDiscardElement(String name, Attributes attributes) method overridden
into the ParseContext.

  was:
Currently, if we want to disregard HTML elements by attribute value/existence, an example
from one of our projects
{code:html}
<div data-meta-no-index>Some content to be ignored by custom search indexer (Tika parser)</div>
{code}
it's required to implement a custom handler with logic very similar to what we have in org.apache.tika.parser.html.HtmlHandler.
While it can be easily done by keep using HtmlHandler, but setting an instance of HtmlMapper
with (newly added) isDiscardElement(String name, Attributes attributes) method overridden
into the ParseContext.


> Extend HtmlMapper isDiscardElement method with Attributes parameter
> -------------------------------------------------------------------
>
>                 Key: TIKA-2610
>                 URL: https://issues.apache.org/jira/browse/TIKA-2610
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.17
>            Reporter: Aleksei Udalov
>            Priority: Major
>
> Currently, if we want to discard HTML elements by attribute value/existence, an example
from one of our projects
> {code:html}
> <div data-meta-no-index>Some content to be ignored by custom search indexer (Tika
parser)</div>
> {code}
> it's required to implement a custom handler with logic very similar to what we have in
org.apache.tika.parser.html.HtmlHandler. While it can be easily done by keep using HtmlHandler,
but setting an instance of HtmlMapper with (newly added) isDiscardElement(String name, Attributes
attributes) method overridden into the ParseContext.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message