tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (Jira)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2963) Tika在抽取.xlsx类型的大文件时出现OOM错误
Date Fri, 11 Oct 2019 20:16:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949772#comment-16949772
] 

Tim Allison commented on TIKA-2963:
-----------------------------------

The xlsx parser uses SAX for the sheet parsing but relies on DOM/beans for the SharedStrings
table, which can be a, um, problem. There’s been work on POI, not yet merged, to store that
table in H2 for a lower memory footprint.

If you unzip the file, how big is it? How big is its shared strings table.

Obviously, there could be other areas for improvement...

> Tika在抽取.xlsx类型的大文件时出现OOM错误
> --------------------------
>
>                 Key: TIKA-2963
>                 URL: https://issues.apache.org/jira/browse/TIKA-2963
>             Project: Tika
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.20
>            Reporter: Feng Jiao Jiang
>            Priority: Major
>         Attachments: demo.java
>
>
> 对于docx和pptx类型的文件,Tika可配置SAX解析器来提高抽取性能。但是Tika在抽取.xlsx类型的大文件时仍会出现OOM错误,我暂时没有从官方找到解决方案,下面附上自己的代码,也是基于SAX解析器的解决方案,代码可根据实际情况进行参数调优,多有不足之处,大家批评指正,谢谢



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message