tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rafael Ferreira (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2543) No content extraction for application/x-webarchive format
Date Wed, 17 Oct 2018 15:10:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653711#comment-16653711
] 

Rafael Ferreira commented on TIKA-2543:
---------------------------------------

This seems like a more widespread issue than I imagined, extracting content from any plist
seems to not work ATM, trying to parse a Pages file (pages version 7.2) triggers the EmptyParser
and no text extracted. 

> No content extraction for application/x-webarchive format
> ---------------------------------------------------------
>
>                 Key: TIKA-2543
>                 URL: https://issues.apache.org/jira/browse/TIKA-2543
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.17
>         Environment: MacOS 10.13.2 JDK8 
>            Reporter: Rafael Ferreira
>            Priority: Minor
>         Attachments: Apache Tika – Configuring Tika.webarchive
>
>
> Steps to reproduce: 
> # Using safari save any web page as "webarchive" 
> # Use tika to extract the archive content like the example below
> Expected result: 
> I would expect tika to extract the html contents from the webarchive
> Actual results:
> Nothing is extracted albeit the right mime type is identified. 
> {code:java}
>  try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, Charsets.UTF_8))
{
>       TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();
>       // this looks for content anywhere in the page independently of orientation
>       tesseractOCRConfig.setPageSegMode("11");
>       ParseContext context = new ParseContext();
>       context.set(Parser.class, tika.getParser());
>       context.set(TesseractOCRConfig.class, tesseractOCRConfig);
>       try (InputStream fd = Files.newInputStream(path)) {
>         tika.getParser().parse(fd, new WriteOutContentHandler(writer), new Metadata(),
context);
>       } catch (SAXException e) {
>         throw new EngineError(e);
>       }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message