tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oleg Tikhonov (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (TIKA-245) Support of CHM Format
Date Mon, 06 Dec 2010 06:54:12 GMT

    [ https://issues.apache.org/jira/browse/TIKA-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966721#action_12966721
] 

Oleg Tikhonov edited comment on TIKA-245 at 12/6/10 1:52 AM:
-------------------------------------------------------------

A couple of weeks ago I received the answer from SourceForge.net:
"My apologies for not passing this message on sooner, however the project  admin has responded
that he is not willing to give up this project at this time. As such, we are not fulfilling
this takeover request."

The library as it is today contains critical bugs, because the fact that project is abandoned
I cannot fix its bugs, thus I would exclude it as an option.

Other option - 7-Zip-JBinding (http://sourceforge.net/projects/sevenzipjbind/develop/). I've
implemented chm parser using this library, it works pretty well, the throughput of html extracting
is about 5mb/sec. However, it's licensed under LGPL. I've asked Boris Brodski (the developer
of that library) if he could re-license it for us. Here is a link to the discussion between
him and Igor Pavlov (the author of 7Zip).
http://sourceforge.net/projects/sevenzip/forums/forum/45797/topic/3983892

What do you think?

BR,
Oleg 

  
> Support of CHM Format
> ---------------------
>
>                 Key: TIKA-245
>                 URL: https://issues.apache.org/jira/browse/TIKA-245
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>         Environment: All
>            Reporter: Karl Heinz Marbaise
>            Priority: Minor
>         Attachments: TIKA-245.tikhonov.20103107.patch.txt
>
>
> It might be a good idea to support the CHM File format of Windows. Some information about
http://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help#Extracting_to_HTML. The CHM format
contains HTML files which can be parsed by Tika. So the "only" problem is to extract the data
from the CHM file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message