tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2021) Improving accuracy of Tesseract parser for Serial Number and Part Number (Numeric) Extraction
Date Thu, 07 Jul 2016 08:51:11 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365830#comment-15365830
] 

Hudson commented on TIKA-2021:
------------------------------

SUCCESS: Integrated in Tika-trunk #1079 (See [https://builds.apache.org/job/Tika-trunk/1079/])
fix for TIKA-2021 contributed by Zarana Parekh (zaranaparekh17: rev 48b27d219f791ee14f1e0ffa18e4e80583f3df54)
* tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* tika-bundle/pom.xml
* tika-parsers/pom.xml
* tika-parsers/src/main/resources/org/apache/tika/parser/ocr/rotation.py
* tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
* tika-parsers/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties
fix for TIKA-2021 contributed by Zarana Parekh (zaranaparekh17: rev de84d71b145045792b8a3bd175634251623188dc)
* tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* tika-bundle/pom.xml
Record TIKA-2021 change. (mattmann: rev 636060eb6c4a2ea4960ccc045f8bc5ae159c9117)
* CHANGES.txt


> Improving accuracy of Tesseract parser for Serial Number and Part Number (Numeric) Extraction
> ---------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2021
>                 URL: https://issues.apache.org/jira/browse/TIKA-2021
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr, parser
>            Reporter: Zarana Parekh
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.14
>
>
> Tesseract OCR parser works well with images containing English text. However, there is
possibility of improvement in case of alphanumeric and numeric content which require training
Tesseract with the relevant cases in order to better extract content from images. Such a customization
can be helpful in extraction of serial numbers from images of counterfeit electronics and
other applications focussing on atypical textual content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message