tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sandeepan (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2261) TikaOcr giving different result across platforms
Date Wed, 08 Feb 2017 13:56:41 GMT
Sandeepan created TIKA-2261:
-------------------------------

             Summary: TikaOcr giving different result across platforms
                 Key: TIKA-2261
                 URL: https://issues.apache.org/jira/browse/TIKA-2261
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.14
            Reporter: Sandeepan


Hi,

I am using Tika to parse every type of file and it works great for non image files. 

My local machine is an Mac but I deploy stuff on ubuntu 14.04. On command line, i get the
same result on both the platforms.
Example Command
tesseract 3.jpg ouput -l eng -psm 1 txt

But when I use it through Java code, it gives me very different results and the quality is
worse in case of ubuntu.

Sample Code

        AutoDetectParser parser = new AutoDetectParser();
        BodyContentHandler handler = new BodyContentHandler(-1);
        Metadata metadata = new Metadata();
        FileInputStream in = new FileInputStream(path);
        parser.parse(in, handler, metadata);
        parsedText = handler.toString();

On Mac :
++++++
$ tesseract -v
tesseract 3.04.01
 leptonica-1.74.1
  libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8

On Ubuntu
ubuntu@ubuntu-4gb-postprocess:~$ tesseract -v
tesseract 3.04.01
 leptonica-1.74.1
  libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8

Not able to figure out what the issue is. \



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message