tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pascal Magnard (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2282) Paragraph numbering is not extracted from DOCX and ODT.
Date Wed, 01 Mar 2017 08:39:45 GMT
Pascal Magnard created TIKA-2282:
------------------------------------

             Summary: Paragraph numbering is not extracted from DOCX and ODT.
                 Key: TIKA-2282
                 URL: https://issues.apache.org/jira/browse/TIKA-2282
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.14
         Environment: Windows 10
MS Word 2016
LibreOffice 5.3
            Reporter: Pascal Magnard
            Priority: Minor


When extracting text with AutoDetectParser, paragraph auto-numbering is not extracted for
.docx and .odt. For .doc file, this numbering is correctly extracted (or should I write recomputed).
I'm working on a project where the numbering information in the original document is critical
for the users.
In details, for the provided samples, sample.doc gives :
1 This is the first level
1.1 This is the second level
1.2 This is still second level
1.2.1 First repeat of third level
2 First repeat of first level
2.1 Fist Second
2.1.1 Second Third
2.2 Second Second
2.2.1 Third Third
-----------------------------------------------------------
which seems OK.
But sample.docx and sample.odt give :
This is the first level
This is the second level
This is still second level
First repeat of third level
First repeat of first level
Fist Second
Second Third
Second Second
Third Third
-----------------------------------------------------------



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message