tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rafael Ferreira (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2543) No content extraction for application/x-webarchive format
Date Sun, 07 Jan 2018 06:17:00 GMT
Rafael Ferreira created TIKA-2543:
-------------------------------------

             Summary: No content extraction for application/x-webarchive format
                 Key: TIKA-2543
                 URL: https://issues.apache.org/jira/browse/TIKA-2543
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.17
         Environment: MacOS 10.13.2 JDK8 
            Reporter: Rafael Ferreira
            Priority: Minor


Steps to reproduce: 
# Using safari save any web page as "webarchive" 
# Use tika to extract the archive content like the example below

Expected result: 
I would expect tika to extract the html contents from the webarchive
Actual results:
Nothing is extracted albeit the right mime type is identified. 


{code:java}
 try (BufferedWriter writer = Files.newBufferedWriter(extractedContentPath, Charsets.UTF_8))
{
      TesseractOCRConfig tesseractOCRConfig = new TesseractOCRConfig();

      // this looks for content anywhere in the page independently of orientation
      tesseractOCRConfig.setPageSegMode("11");

      ParseContext context = new ParseContext();
      context.set(Parser.class, tika.getParser());
      context.set(TesseractOCRConfig.class, tesseractOCRConfig);

      try (InputStream fd = Files.newInputStream(path)) {
        tika.getParser().parse(fd, new WriteOutContentHandler(writer), new Metadata(), context);

      } catch (SAXException e) {
        throw new EngineError(e);
      }
{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message