tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Johan (Jira)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-3007) Heic images are detected as "application/mp4" when using tika as server
Date Tue, 17 Dec 2019 10:39:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998072#comment-16998072

Johan commented on TIKA-3007:


Ok we see that indeed your call above is working but then we have some questions about how
this relates to Content-Type in either the server version or app version.
*Question 1:* The -j option to output metadata as json on  heic/heif images should work similar
as on other file types?

We took the images from the  {{tika-parsers/src/test/resources/test-documents}} as examples
to explain the different results we see which seems to be inconsistent,

-d or detect stream on server returns image/heic so that is good

java -jar tika-app-1.23.jar -d /~/Desktop/testHEIF.heic

curl -X PUT --data-binary @~/Desktop/testHEIF.heic http://localhost:9998/detect/stream

-j on app does not return anything for heic/heif images but it does for normal jpg

java -jar tika-app-1.23.jar -j ~/Desktop/testHEIF.heic
# nothing
java -jar tika-app-1.23.jar -j ~/Desktop/baseball.jpg
{"Blue Colorant":"(0.1492, 0.0632, 0.7446)","Bl ...

 Now that seems weird to us cause if you just ask -m (metadata without json format) it seems
to work. Also works for -J which gets it for all embedded files.

java -jar tika-app-1.23.jar -m /Users/butsjoh/Desktop/testHEIF.heic
Content-Length: 13706
Content-Type: application/mp4
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.mp4.MP4Parser
resourceName: testHEIF.heic{code}

java -jar tika-app-1.23.jar -J /Users/butsjoh/Desktop/testHEIF.heic

So is this expected that -j does not return anything while -m does. According to the cli docs
-j just returns the metadata in json format (Output metadata in JSON).


*Question 2:* What is the rationale between the difference in Content-Type and mime-type?

I will be referring to question 1 cause if you see in the output of the -m and -J case it
lists application/mp4 as Content-Type for the heic/heif file. Also if we use the server and
ask for http://localhost:9998/meta/Content-Type we get back application/mp4. We would like
to understand why you consider the Content-Type different then the mime-type. If we just only
ask for the metadata (-m, -J or -j on the app jar and /meta on the server) it does not contain
any information about the mime type at all and we cannot identify this file as image/heic.
That is also why i intially created this ticket cause we where still getting application/mp4
back because of our usage of /meta instead of /detect/stream.

Can you please explain the rationale behind this difference cause the documentation does not
really says anything about this. To us it dos not make sense at all that it would still handle
heic/heif images as application/mp4 and you would need to use the cli or server differently
to get correct detection.



> Heic images are detected as "application/mp4" when using tika as server
> -----------------------------------------------------------------------
>                 Key: TIKA-3007
>                 URL: https://issues.apache.org/jira/browse/TIKA-3007
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.23
>            Reporter: Johan
>            Priority: Blocker
> Related to https://issues.apache.org/jira/browse/TIKA-2942
> It seems the detection of the heic imags is working for the standalone jar (tika-app-1.23)
but not for the server component (tika-server-1.23).
> tika-app-1.23.jar from [https://archive.apache.org/dist/tika/] detects the image with
image/heic but it does not work for the server component tika-server-1.23.jar that one returns
still "application/mp4". Any clue what might be going wrong? Code has been added only to
the tika jar client and not to the server?

This message was sent by Atlassian Jira

View raw message