tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joey Hong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1817) Extracts entire file content for ASCII DXF files
Date Fri, 01 Jan 2016 02:36:39 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076229#comment-15076229

Joey Hong commented on TIKA-1817:

Oh, also, regarding the implementation of binary DXF. I made my best guess of what the file
would be like given some online tutorials on the format, but I ran into trouble finding actual
examples of binary DXF files to test my implementation on. Does anyone know where I can get
my hands on some of those sample files?

> Extracts entire file content for ASCII DXF files
> ------------------------------------------------
>                 Key: TIKA-1817
>                 URL: https://issues.apache.org/jira/browse/TIKA-1817
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.11
>            Reporter: Zoltan Toth
>         Attachments: SMA-Controller.dxf, house design.dxf, jcsample-screendump.jpg, jcsample.dxf
> By definition, ASCII DXF files are encoded in plain text.  However. the vast majority
of their content is not intended to be human readable (see https://en.wikipedia.org/wiki/AutoCAD_DXF).
 Unfortunately for these files, Tika simply "extracts" the entire content of the file instead
of the human-readable portions (i.e. comments etc.) that a CAD tool would render.  This results
in massive amounts of rubbish data being returned with dire consequences for applications
that rely on this.
> It would be nice if only the human-readable text fields were extracted.  Failing this,
it would still be nice if no text was extracted from these files at all.  

This message was sent by Atlassian JIRA

View raw message