tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-1325) Move the font metadata definitions to properties
Date Mon, 09 Jun 2014 15:41:02 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025287#comment-14025287
] 

Tim Allison edited comment on TIKA-1325 at 6/9/14 3:39 PM:
-----------------------------------------------------------

Doh!  Same issue for those of us in non-standard land. :)

Failed tests:   testTTFParsing(org.apache.tika.parser.font.FontParsersTest): expected:<1904-01-01T0[0]:00:00Z>
but was:<1904-01-01T0[5]:00:00Z>

As of now, FontBox is setting the Calendar to my default timezone:
1904-01-01T00:00:00 (EDT)

When setTimeZone(UTC) in formatDate is called, this converts the calendar to UTC and the value
is now: 1904-01-01T05:00:00Z 

I like the addition of formatDate(Calendar ...), and I like that it converts/normalizes to
UTC.

For this one test case, though, I think we need to add some modifications to the test case
until PDFBOX-2122 is fixed.  One simple thing we could do (given that we know the source of
the issue) is to set the default time zone to UTC before parser.parse:

{code}
         //until PDFBOX-2122 is fixed, we need to set a common default
         //for the sake of this test.
        TimeZone defaultTimeZone = TimeZone.getDefault();
        TimeZone.setDefault(TimeZone.getTimeZone("UTC"));

        try {
            parser.parse(stream, handler, metadata, context);
        } finally {
            stream.close();
        }
        TimeZone.setDefault(defaultTimeZone);


{code}


was (Author: tallison@mitre.org):
Doh!  Same issue for those of us in non-standard land. :)

Failed tests:   testTTFParsing(org.apache.tika.parser.font.FontParsersTest): expected:<1904-01-01T0[0]:00:00Z>
but was:<1904-01-01T0[5]:00:00Z>

As of now, FontBox is setting the Calendar to my default timezone:
1904-01-01T00:00:00 (EDT)

When setTimeZone(UTC) in formatDate is called, this converts the calendar to UTC and the value
is now: 1904-01-01T05:00:00Z 

I like the addition of formatDate(Calendar ...), and I like that it converts/normalizes to
UTC.

For this one test case, though, I think we need to add some modifications to the test case
until PDFBOX-2122 is fixed.  One simple thing we could do (given that we know the source of
the issue) is to set the default time zone to UTC before parser.parse:

{code}
         //until PDFBOX-2122 is fixed, we need to set a common default
         //for the sake of this test.
        TimeZone.setDefault(TimeZone.getTimeZone("UTC"));

        try {
            parser.parse(stream, handler, metadata, context);
        } finally {
            stream.close();
        }

{code}

> Move the font metadata definitions to properties
> ------------------------------------------------
>
>                 Key: TIKA-1325
>                 URL: https://issues.apache.org/jira/browse/TIKA-1325
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>    Affects Versions: 1.5, 1.6
>            Reporter: Nick Burch
>         Attachments: TIKA-1325_TimeZone.patch
>
>
> As noticed while working on TIKA-1182, the AFM font parser has a bunch of hard coded
strings it uses as metadata keys, while the TTF font parser doesn't have many
> We should switch these to being proper Properties, with definitions from a well known
standard (+ compatibility fallbacks), and have both use largely the same set



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message