tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prashanth Ramaswamy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-245) Support of CHM Format
Date Sat, 01 Feb 2014 23:42:09 GMT

    [ https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13888778#comment-13888778

Prashanth Ramaswamy commented on TIKA-245:

Hi, I still get the Array index exception in trying to parse CHM files.

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array index out of range:
	at java.util.ArrayList.elementData(ArrayList.java:382)
	at java.util.ArrayList.get(ArrayList.java:395)
	at org.apache.tika.parser.chm.core.ChmExtractor.<init>(ChmExtractor.java:178)

There was an old comment that this was fixed?  Is this so, or is the bug still there?

> Support of CHM Format
> ---------------------
>                 Key: TIKA-245
>                 URL: https://issues.apache.org/jira/browse/TIKA-245
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.10
>         Attachments: TIKA-245.oleg.20110806.PATCH, TIKA-245.tikhonov.04082011.patch.txt,
TIKA-245.tikhonov.20103107.patch.txt, TIKA-245.tikhonov.20112603.txt, TIKA-245.tikhonov.20112703.txt
> It might be a good idea to support the CHM File format of Windows. Some information about
http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML. The CHM format
contains HTML files which can be parsed by Tika. So the "only" problem is to extract the data
from the CHM file.

This message was sent by Atlassian JIRA

View raw message