tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oleg Tikhonov (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (TIKA-245) Support of CHM Format
Date Sun, 01 Aug 2010 09:37:18 GMT

    [ https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894362#action_12894362

Oleg Tikhonov edited comment on TIKA-245 at 8/1/10 5:35 AM:

Hi, I've implemented a chm parser, please review it and share what you think.
here is a link: https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12427752

There are some open issues, I would like to discuss.
1. Metadata - chm file contains a lot of different files such as: images, htmls, css, js etc.

2. Currently it does not support multi threading execution.
3. jchm itself has bugs, I fixed one, ArrayIndexOutOfBoundsException, the question is how
to insert  and publish the changes? 

I've written to the author Feng Yu (yfbio@hotmail.com), but still have no answers.

I would like to get your feedback.

> Support of CHM Format
> ---------------------
>                 Key: TIKA-245
>                 URL: https://issues.apache.org/jira/browse/TIKA-245
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Priority: Minor
>         Attachments: TIKA-245.tikhonov.20103107.patch.txt
> It might be a good idea to support the CHM File format of Windows. Some information about
http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML. The CHM format
contains HTML files which can be parsed by Tika. So the "only" problem is to extract the data
from the CHM file.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message