tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick C (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1513) Add mime detection and parsing for dbf files
Date Sun, 10 Apr 2016 20:56:25 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15234276#comment-15234276

Nick C commented on TIKA-1513:

I wrote the detector from scratch a couple months ago because 0x03 caused too many false positives.
For the parser I ended up using jdbf but found some bugs. One was that the parser would error
if inputStream.read(...) returned less than the number of required bytes (The code needs to
use something like IOUtils.readFully)

The logic I used was
- Validate the signature
- Validate the header last update date (Is the month between 1 and 12 and is the day valid
for that month)
- Validate the header size by dividing by 32 and making sure there aren’t more then 255
- Calculate the file size using the record count, header length and record length from the
header making sure its less than 4GB. If I can get the input stream length without reading
the entire stream (TikaInputStream.hasLength or metadata.content_length) I make sure the calculated
size matches (or is within 2 bytes).

I'll put the code up on github tomorrow and get a list of the jdbf bugs.

> Add mime detection and parsing for dbf files
> --------------------------------------------
>                 Key: TIKA-1513
>                 URL: https://issues.apache.org/jira/browse/TIKA-1513
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>             Fix For: 1.13
> I just came across an Apache licensed dbf parser that is available on [maven|https://repo1.maven.org/maven2/org/jamel/dbf/dbf-reader/0.1.0/dbf-reader-0.1.0.pom].
> Let's add dbf parsing to Tika.
> Any other recommendations for alternate parsers?

This message was sent by Atlassian JIRA

View raw message