tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2167) Image processing causes OCR to fail
Date Mon, 07 Nov 2016 12:43:59 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15644077#comment-15644077
] 

Tim Allison commented on TIKA-2167:
-----------------------------------

Thank you for opening this.  Can you give more info on what is failing?  What is the stacktrace?
 How are you running Tika (tika-app, tika-server, API)?

When I run this in trunk via the API:
{noformat}
    @Test
    public void testTiff() throws Exception {
        XMLResult r = getXML("simple.tiff");
        System.out.println(r.xml);
    }
{noformat}

I get this: 
{noformat}
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.ocr.TesseractOCRParser" />
<meta name="Content-Type" content="image/tiff" />
<title></title>
</head>
<body><div class="ocr">HEAVY
METAL

</div>
<html>

<meta name="Strip Byte Counts" content="23139 2217 bytes" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.ocr.TesseractOCRParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.image.TiffParser" />
<meta name="Compression" content="LZW" />
<meta name="File Modified Date" content="Mon Nov 07 12:38:12 +00:00 2016" />
<meta name="Predictor" content="2" />
<meta name="tiff:SamplesPerPixel" content="3" />
<meta name="Unknown tag (0x0153)" content="1 1 1" />
<meta name="tiff:ImageLength" content="165" />
<meta name="Samples Per Pixel" content="3 samples/pixel" />
<meta name="Inter Color Profile" content="[3144 values]" />
<meta name="Image Height" content="165 pixels" />
<meta name="Strip Offsets" content="8 23147" />
<meta name="Orientation" content="Top, left side (Horizontal / normal)" />
<meta name="tiff:Orientation" content="1" />
<meta name="Planar Configuration" content="Chunky (contiguous for each subsampling pixel)"
/>
<meta name="Image Width" content="306 pixels" />
<meta name="Photometric Interpretation" content="RGB" />
<meta name="File Size" content="28710 bytes" />
<meta name="Rows Per Strip" content="142 rows/strip" />
<meta name="File Name" content="apache-tika-434186512334376884.tmp" />
<meta name="tiff:BitsPerSample" content="8" />
<meta name="tiff:ImageWidth" content="306" />
<meta name="Content-Type" content="image/tiff" />
<meta name="Bits Per Sample" content="8 8 8 bits/component/pixel" />
<title></title>

<body /></html></body></html>
{noformat}

Looks like we need to fix the xhtml (just opened TIKA-2169), but I'm not getting a fail...

> Image processing causes OCR to fail
> -----------------------------------
>
>                 Key: TIKA-2167
>                 URL: https://issues.apache.org/jira/browse/TIKA-2167
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr
>    Affects Versions: 1.14
>         Environment: Mac OS X 10.11.6; Java 1.8.0_45; tesseract 3.04.01; ImageMagick
6.9.6-2
>            Reporter: Matthew Caruana Galizia
>            Priority: Critical
>              Labels: convert, image, ocr, tiff
>         Attachments: simple.tiff
>
>
> Image processing before OCR is enabled by default in the OCR configuration properties
file. Unless this is disabled, running Tika on a simple TIFF image (attached) with two clear
words fails. When image processing is disabled, it succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message