tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joey Hong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1817) Extracts entire file content for ASCII DXF files
Date Fri, 01 Jan 2016 02:36:39 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076229#comment-15076229
] 

Joey Hong commented on TIKA-1817:
---------------------------------

Oh, also, regarding the implementation of binary DXF. I made my best guess of what the file
would be like given some online tutorials on the format, but I ran into trouble finding actual
examples of binary DXF files to test my implementation on. Does anyone know where I can get
my hands on some of those sample files?

> Extracts entire file content for ASCII DXF files
> ------------------------------------------------
>
>                 Key: TIKA-1817
>                 URL: https://issues.apache.org/jira/browse/TIKA-1817
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.11
>            Reporter: Zoltan Toth
>         Attachments: SMA-Controller.dxf, house design.dxf, jcsample-screendump.jpg, jcsample.dxf
>
>
> By definition, ASCII DXF files are encoded in plain text.  However. the vast majority
of their content is not intended to be human readable (see https://en.wikipedia.org/wiki/AutoCAD_DXF).
 Unfortunately for these files, Tika simply "extracts" the entire content of the file instead
of the human-readable portions (i.e. comments etc.) that a CAD tool would render.  This results
in massive amounts of rubbish data being returned with dire consequences for applications
that rely on this.
> It would be nice if only the human-readable text fields were extracted.  Failing this,
it would still be nice if no text was extracted from these files at all.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message