tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2169) Fix xhtml in combination OCR+metadata extraction from images
Date Mon, 28 Nov 2016 16:49:59 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702456#comment-15702456
] 

Hudson commented on TIKA-2169:
------------------------------

UNSTABLE: Integrated in Jenkins build Tika-trunk #1145 (See [https://builds.apache.org/job/Tika-trunk/1145/])
TIKA-2169 -- fix xhtml markup caused by bug in OCR parser (tallison: rev 2df8567ffc688a29de1394a208e651961a8ab53a)
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java


> Fix xhtml in combination OCR+metadata extraction from images
> ------------------------------------------------------------
>
>                 Key: TIKA-2169
>                 URL: https://issues.apache.org/jira/browse/TIKA-2169
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>             Fix For: 2.0, 1.15
>
>
> In trunk, I'm getting an embedded html entity for the image's metadata when Tesseract
is available:
> <html>
> ocr content
>  <html>
>  ...metadata
> </html>
> </html>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message