tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-2021) Improving accuracy of Tesseract parser for Serial Number and Part Number (Numeric) Extraction
Date Fri, 08 Jul 2016 19:34:11 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368198#comment-15368198
] 

Tim Allison edited comment on TIKA-2021 at 7/8/16 7:33 PM:
-----------------------------------------------------------

Any chance you could make the check for python static and remove the e.printStackTrace()s?
 Thank you!

Wait...it would also be good to apply this to 2.x


was (Author: tallison@mitre.org):
Any chance you could make the check for python static and removing the e.printStackTrace()s?
 Thank you!

> Improving accuracy of Tesseract parser for Serial Number and Part Number (Numeric) Extraction
> ---------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2021
>                 URL: https://issues.apache.org/jira/browse/TIKA-2021
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr, parser
>            Reporter: Zarana Parekh
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.14
>
>
> Tesseract OCR parser works well with images containing English text. However, there is
possibility of improvement in case of alphanumeric and numeric content which require training
Tesseract with the relevant cases in order to better extract content from images. Such a customization
can be helpful in extraction of serial numbers from images of counterfeit electronics and
other applications focussing on atypical textual content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message