tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pascal Magnard (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-2282) Paragraph auto-numbering is not extracted from DOCX and ODT.
Date Wed, 01 Mar 2017 08:41:45 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Pascal Magnard updated TIKA-2282:
---------------------------------
    Summary: Paragraph auto-numbering is not extracted from DOCX and ODT.  (was: Paragraph
numbering is not extracted from DOCX and ODT.)

> Paragraph auto-numbering is not extracted from DOCX and ODT.
> ------------------------------------------------------------
>
>                 Key: TIKA-2282
>                 URL: https://issues.apache.org/jira/browse/TIKA-2282
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>         Environment: Windows 10
> MS Word 2016
> LibreOffice 5.3
>            Reporter: Pascal Magnard
>            Priority: Minor
>         Attachments: sample.doc, sample.doc.tika.txt, sample.docx, sample.docx.tika.txt,
sample.odt, sample.odt.tika.txt
>
>
> When extracting text with AutoDetectParser, paragraph auto-numbering is not extracted
for .docx and .odt. For .doc file, this numbering is correctly extracted (or should I write
recomputed).
> I'm working on a project where the numbering information in the original document is
critical for the users.
> In details, for the provided samples, sample.doc gives :
> 1 This is the first level
> 1.1 This is the second level
> 1.2 This is still second level
> 1.2.1 First repeat of third level
> 2 First repeat of first level
> 2.1 Fist Second
> 2.1.1 Second Third
> 2.2 Second Second
> 2.2.1 Third Third
> -----------------------------------------------------------
> which seems OK.
> But sample.docx and sample.odt give :
> This is the first level
> This is the second level
> This is still second level
> First repeat of third level
> First repeat of first level
> Fist Second
> Second Third
> Second Second
> Third Third
> -----------------------------------------------------------



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message