tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lu Sun <vistax...@gmail.com>
Subject Re: Parsing order issue
Date Mon, 06 Jan 2020 23:04:06 GMT
Dear PDFBox Dev Team,

After searching through online
<https://stackoverflow.com/search?page=5&tab=Relevance&q=pdfbox%20order>, I
am certain that using setSortByPosition(true) would help. However, I am
struggling to get the config file right. Can you please provide any advice
on it?

Thanks so much in advance. Regards, Luke

On Fri, 20 Dec 2019 at 18:06, Lu Sun <vistaxjtu@gmail.com> wrote:

> Dear PDFBox Dev Team,
>
> Hope this message finds you well.
>
> Just wanted to raise this for your attention. Please can you provide any
> solutions on the parsing order issue? Attached is my config file, an
> example of pdf file and my parsing results.
>
> Thanks so much in advance. Wish you and your team a Merry Christmas and
> Happy New Year.
>
> Regards,
> Luke
>
> On Tue, 17 Dec 2019 at 12:34, Tim Allison <tallison@apache.org> wrote:
>
>> PDFBox Colleagues,
>>   Any recommendations?
>>
>> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vistaxjtu@gmail.com> wrote:
>>
>>> Dear Tika Dev Team,
>>>
>>>
>>>
>>> Hope this email finds you well.
>>>
>>>
>>>
>>> I have been actively using Tika for pdf file reading. One issue I found
>>> is the parsing order. As shown in attached image, the parsing order of pdf
>>> file is not  based on position of texts.
>>>
>>>
>>>
>>> As suggested in this github link
>>> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
>>> customized config file (see attached), hoping to solve the issue. But this
>>> has not worked out. If any chance, can you please review this issue, and
>>> provide any insights or solutions?
>>>
>>>
>>>
>>> Thanks so much in advance.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Luke
>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message