tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2722) Don't call Date.toString (Possible issue with JDK 11)
Date Wed, 05 Sep 2018 14:58:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604516#comment-16604516
] 

Uwe Schindler commented on TIKA-2722:
-------------------------------------

[~dsmiley]: I think this is a bug in Java 11. I know there were some changes with formatting
time zones. According to their docs, the timezones are now printed according to the selected
locale, if none given, the default one. This is fine in most cases, but seems to affect locales
where the digits are different (non-ascii). Previously timezones that have no name (numeric
only) seem to have been printed in ASCII digits. Nevertheless, only the timezone is printed
with locale dependent digits, not the date itsself (reason: no date formatter is used, it
just concats integers to format the date in toString for compatibility reasons).

Did you send Rory O'Donnel a note, he can speedup assigning the JDK issue ID?!

IMHO: TIKA should stop using java.util.Date and should go for java.time APIs, maybe start
with using Instant instead of Date.

> Don't call Date.toString (Possible issue with JDK 11)
> -----------------------------------------------------
>
>                 Key: TIKA-2722
>                 URL: https://issues.apache.org/jira/browse/TIKA-2722
>             Project: Tika
>          Issue Type: Bug
>         Environment: Tika 1.18, JDK 11 with locale set to "ar-EG".  
>            Reporter: David Smiley
>            Priority: Major
>
> I'm troubleshooting [a test failure in Apache Lucene/Sor|https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/22799/] "extracting"
contrib that occurs in JDK 11 with locale "ar-EG".  JDK 8 & 9 passes; I don't know about
JDK 10. It has to do with extracting date metadata from a PDF, particularly the created date
but perhaps others too.
> I stepped through the code into Tika and I think I've found out where the troublesome
code is.  First note PDFParser line 271: {{addMetadata(metadata, "created", info.getCreationDate());}}.
 That addMetadata overload variant will call toString on a Date.  IMO that's asking for trouble
since the output of that is Locale-dependent.  I think that's okay to show to a user but not
for machine-to-machine information exchange.  In the case of the test, it yielded this odd
looking date string:
> Thu Nov 13 18:35:51 GMT+٠٥:٠٠ 2008
> I pasted that in and it looks consistent with what I see in IntelliJ and in Jenkins logs;
hopefully will post correctly to JIRA.  The odd part is the hour & minutes relative to
GMT.  I won't be certain until after I click "Create".
> Perhaps this problem is also indicative of a JDK 11 bug?  Nevertheless I think Tika should
avoid calling Date.toString().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message