tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Meikle (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-103) Excel parsing ignores cell formating
Date Tue, 05 Jan 2010 01:26:54 GMT

    [ https://issues.apache.org/jira/browse/TIKA-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796474#action_12796474
] 

Dave Meikle commented on TIKA-103:
----------------------------------

I am not sure how others feel about this issue but for me it is one I would like addressed
as I have some applications that are parsing excel files containing various types of formatting
and it would be good to have the 'as-is' value within the parsed content.

As can be seen above, I have attached an initial patch to have TikaHSSFListener use the FormatTrackingHSSFListener
proxy class to handle the current POI supported formatting - this moves us on some what leaving
the outstanding support something to progress within POI[1]. The reason I have attached the
patch instead of committing directly is that I would like to propose the following:

* This initial support is included in the up-coming 0.6 release
* An issue is raised against POI, and any fixes to support the other formatting is progressed
there.

Not sure what you all think?

I am going to have a track through to see how Niall got on, if he managed to get the time.

Cheers,
Dave

[1] I will continue to see if there are any user model features we can use to add further
support from the current POI code base.

> Excel parsing ignores cell formating
> ------------------------------------
>
>                 Key: TIKA-103
>                 URL: https://issues.apache.org/jira/browse/TIKA-103
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Niall Pemberton
>         Attachments: testEXCEL-formats.xls, tika-103_initial_patch.diff
>
>
> Unfortunately Excel stores dates as the number of days since 1900 (or 1904, but ignore
that atm)  with the time element being stored in the fractional part of the numeric value.
So for example 19 Jan 2008 04:35:01 is stored as Double value 39466.190980358806.  The only
way to make sense of the data is to look at the formatting on the cell. Although dates are
the worst case, it also affects other numeric values - currencies, percentages, scientific,
fractions and worst of all custom formats.
> POI recognises 49 "built in" formats of excel and for those it has the limited capability
of determining whether a numeric cell is a date or not and if it is, a utility to convert
to a java date, something like:
>     if (HSSFDateUtil.isCellDateFormatted(cell)) {
>         Date date = HSSFDateUtil.getJavaDate(cell.getNumericCellValue());
>     }
> The current ExcelParser implementation takes no account of the data format and IMO is
going to severly limit how useful that implementation is. I'm also think that the above while
improving the situation slightly is still not great. I asked about this on the POI dev list
a couple of days ago[1] and the only light is someone posted a format parser a few months
back. It sounds like POI will accept that contribution if it has unit tests. So I'm going
to try and find time to do that. If the data format can be properly parsed then it means being
able to extract it in the format the users sees it within Excel - which IMO would be the ideal
situation.
> [1] http://www.mail-archive.com/dev@poi.apache.org/msg00582.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message