tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2722) Don't call Date.toString (Possible issue with JDK 11)
Date Wed, 05 Sep 2018 06:34:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603979#comment-16603979

Nick Burch commented on TIKA-2722:

Currently, Tika stores all metadata internally as Strings. For typed properties, getters and
setters will convert to/from the native types and the strings, to eg let you get a {{Date}}
back if you wanted it. (This also lets you get all metadata irrespective of the type if you
want. Other approaches for storage have been suggested, none have won the argument to change
just yet!)


For {{Date}} properties, there's a bunch of logic in Tika that tries to take care of the formatting,
thread safety etc. See {{org.apache.tika.utils.DateUtils.formatDate}} for the full details.
That should all be going via {{String.format(Locale.Root, ....}} to avoid any issues


For PDFs specifically, for the well-known typed Date properties, we ought to be getting a
{{Calendar}} back from PDFBox, then getting a {{Date}} object from that to set on the {{Metadata}}
object, which then internally formats, no {{toString}} calls. If you've found a case where
that route isn't being followed, a small PDF and possibly a unit test to show it would be
great, so we can fix that!

> Don't call Date.toString (Possible issue with JDK 11)
> -----------------------------------------------------
>                 Key: TIKA-2722
>                 URL: https://issues.apache.org/jira/browse/TIKA-2722
>             Project: Tika
>          Issue Type: Bug
>         Environment: Tika 1.18, JDK 11 with locale set to "ar-EG".  
>            Reporter: David Smiley
>            Priority: Major
> I'm troubleshooting [a test failure in Apache Lucene/Sor|https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/22799/] "extracting"
contrib that occurs in JDK 11 with locale "ar-EG".  JDK 8 & 9 passes; I don't know about
JDK 10. It has to do with extracting date metadata from a PDF, particularly the created date
but perhaps others too.
> I stepped through the code into Tika and I think I've found out where the troublesome
code is.  First note PDFParser line 271: {{addMetadata(metadata, "created", info.getCreationDate());}}.
 That addMetadata overload variant will call toString on a Date.  IMO that's asking for trouble
since the output of that is Locale-dependent.  I think that's okay to show to a user but not
for machine-to-machine information exchange.  In the case of the test, it yielded this odd
looking date string:
> Thu Nov 13 18:35:51 GMT+٠٥:٠٠ 2008
> I pasted that in and it looks consistent with what I see in IntelliJ and in Jenkins logs;
hopefully will post correctly to JIRA.  The odd part is the hour & minutes relative to
GMT.  I won't be certain until after I click "Create".
> Perhaps this problem is also indicative of a JDK 11 bug?  Nevertheless I think Tika should
avoid calling Date.toString().

This message was sent by Atlassian JIRA

View raw message