tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harinder (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2755) Allow Tika to skip extraction of <img> tags in HTML
Date Fri, 12 Oct 2018 18:17:00 GMT
Harinder created TIKA-2755:
------------------------------

             Summary: Allow Tika to skip extraction of <img> tags in HTML
                 Key: TIKA-2755
                 URL: https://issues.apache.org/jira/browse/TIKA-2755
             Project: Tika
          Issue Type: Improvement
          Components: server
    Affects Versions: 1.19.1
            Reporter: Harinder
         Attachments: TestForImageTag.html

We are using Tika Server to extract text from HTML files. Tika extracts the alt text of image
tags present in HTML files as _[image: this is the alt text of the image]_. This ends up in
Solr and shows up in the results when we generate document summaries at query time (via Solr’s
highlight functionality).

If you PUT the attached HTML file to /tika, it will return the following response
{code:java}
[image: Return to the homepage]
This is a test{code}
It would be nice to have just this instead
{code:java}
This is a test {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message