tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Jackson (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-1154) Tika hangs on format detection of malformed HTML file.
Date Thu, 25 Jul 2013 09:32:00 GMT
Andrew Jackson created TIKA-1154:

             Summary: Tika hangs on format detection of malformed HTML file.
                 Key: TIKA-1154
                 URL: https://issues.apache.org/jira/browse/TIKA-1154
             Project: Tika
          Issue Type: Bug
          Components: mime
    Affects Versions: 1.4
            Reporter: Andrew Jackson
            Priority: Minor

We are using Tika on large web archives, which also happen to contain some malformed files.
In particular, we found a HTML file with binary characters in the DOCTYPE declaration. This
hangs Tika, either embedded or from the command line, during format detection.

An example file is attached.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message