tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harinder (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-2755) Allow Tika to skip extraction of <img> tags in HTML
Date Fri, 12 Oct 2018 20:54:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16648426#comment-16648426
] 

Harinder edited comment on TIKA-2755 at 10/12/18 8:53 PM:
----------------------------------------------------------

Sorry for not being clear. I am using Tika Server.

Here is what I do

1. Start Tika Server 
{code:java}
java -jar tika-server-1.19.1.jar{code}
2. Send a PUT request to [http://localhost:9998/tika] via Postman

    Content-Type: text/html
     Send the file in the body of the request

3. See that Tika's response has  [image: ] tags in it

 

It is clear from your comment that the -t option is what I need. How can I configure that
in Tika Server?

 

EDIT: Curl example

curl -X PUT --data-binary @TestForImageTag.html http://localhost:9998/tika --header "Content-type:
text/html"


was (Author: hanjan):
Sorry for not being clear. I am using Tika Server.

Here is what I do

1. Start Tika Server 
{code:java}
java -jar tika-server-1.19.1.jar{code}
2. Send a PUT request to http://localhost:9998/tika via Postman

    Content-Type: text/html
    Send the file in the body of the request

3. See that Tika's response has  [image: ] tags in it

 

It is clear from your comment that the -t option is what I need. How can I configure that
in Tika Server?

 

> Allow Tika to skip extraction of <img> tags in HTML
> ---------------------------------------------------
>
>                 Key: TIKA-2755
>                 URL: https://issues.apache.org/jira/browse/TIKA-2755
>             Project: Tika
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 1.19.1
>            Reporter: Harinder
>            Priority: Major
>         Attachments: TestForImageTag.html
>
>
> We are using Tika Server to extract text from HTML files. Tika extracts the alt text
of image tags present in HTML files as _[image: this is the alt text of the image]_. This
ends up in Solr and shows up in the results when we generate document summaries at query time
(via Solr’s highlight functionality).
> If you PUT the attached HTML file to /tika, it will return the following response
> {code:java}
> [image: Return to the homepage]
> This is a test{code}
> It would be nice to have just this instead
> {code:java}
> This is a test {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message