tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Caruana Galizia (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2473) PCX and DCX image support
Date Fri, 06 Oct 2017 10:32:00 GMT
Matthew Caruana Galizia created TIKA-2473:
---------------------------------------------

             Summary: PCX and DCX image support
                 Key: TIKA-2473
                 URL: https://issues.apache.org/jira/browse/TIKA-2473
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.16
            Reporter: Matthew Caruana Galizia


It's straightforward in theory to implement support for PCX and DCX. There's support for it
in Commons Imaging as well as in ImageIO via TwelveMonkeys.

In practise, however, I'm not really sure how implement support. We obviously want to OCR
the images, but Tesseract has no support for the format. So where do we do the conversion
to a BufferedImage? I tried to look for what is done to handle JBIG2 files but I can't find
that anywhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message