tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexey Zhukov (Jira)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2310) Try to order chapters in epub correctly
Date Fri, 10 Jan 2020 09:08:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012593#comment-17012593
] 

Alexey Zhukov commented on TIKA-2310:
-------------------------------------

Opf file does correctly processed, but EpubParser implementation presumes that spine contents
are to be placed into htm and html files only (see EpubParser.java:282) and ignores those
with different type. But looks like EPUB specification ([link|[https://www.w3.org/publishing/epub3/epub-spec.html#dfn-epub-content-document]])
does allow file extension that are differ from htm/html and there may exist epub files (see
attached) that can't be correctly parsed 

[^Dzhordzh_Oruell_1984_en_.epub]

> Try to order chapters in epub correctly
> ---------------------------------------
>
>                 Key: TIKA-2310
>                 URL: https://issues.apache.org/jira/browse/TIKA-2310
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 1.21
>
>         Attachments: Dzhordzh_Oruell_1984_en_.epub
>
>
> [~johanvanderknijff] recently pointed out on twitter that our Epub parser doesn't handle
chapters in the right order.  We should try to fix our parser so that the output is in the
correct order.
> Epub is new to me, but it looks like we can scrape the order out of content.opf(?).
> This would require dumping the stream to a ZipFile for direct access to zip entries,
but we require that of ooxml...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message