tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2696) Support output of Tesseract OSD output for psm mode 0
Date Wed, 10 Oct 2018 14:15:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645023#comment-16645023

ASF GitHub Bot commented on TIKA-2696:

tballison commented on issue #246: TIKA-2696 Add support for OSD output, contributed by @4U6U57
URL: https://github.com/apache/tika/pull/246#issuecomment-428588697
   Sorry for my delay.  
   Three things.
   1) What would you think of "parsing" the output into metadata fields?  That info feels
to me far more like metadata than "content".
   2) Do we know at what version of tesseract, the osd info is written to an .osd file rather
than to stdout?  On my ancient version (see below), that info is dumped to stderr, not written
to an osd file.  Should we just rely on users having a more modern version that writes an
osd file?
   3) If we do parse this info, do we have a sense of how much it changes across versions?
   This is what I see with an old version of Tesseract:
   Tesseract Open Source OCR Engine v3.04.00 with Leptonica
   Orientation: 0
   Orientation in degrees: 0
   Orientation confidence: 28.65
   Script: 1
   Script confidence: 144.00

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> Support output of Tesseract OSD output for psm mode 0
> -----------------------------------------------------
>                 Key: TIKA-2696
>                 URL: https://issues.apache.org/jira/browse/TIKA-2696
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr
>            Reporter: August Valera
>            Priority: Minor
> TIKA-2357 added support for additional PSM (page segmentation modes) for Tesseract OCR,
including mode 0, which is {{Orientation and script detection (OSD) only}}, meaning it does
not perform OCR, just outputs orientation and script information.
> An example usage of mode 0:
> {code:java}
> $ tesseract infile.png outfile --psm 0 -l osd
> {code}
> In this mode, the usual {{outfile.txt}} is not created. Instead, and similar to other
modes that run OSD in addition to extraction, the result is an {{outfile.osd}} file, like
> {code:java}
> Page 1
> Warning. Invalid resolution 0 dpi. Using 70 instead.
> Estimating resolution as 212
> Page number: 0
> Orientation in degrees: 0
> Rotate: 0
> Orientation confidence: 13.73
> Script: Latin
> Script confidence: 4.78
> {code}
> However, {{TesseractOCRParser#parse(...)}} is [coded|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java#L437] to
only read the contents of {{outfile.txt}} (alternatively {{outfile.hocr}}) in all modes, so
mode 0 outputs nothing regardless of input.
> This is consistent with Tika's goal to output extracted text, but against the intention
of the user expecting OSD output.

This message was sent by Atlassian JIRA

View raw message