tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Parsing order issue
Date Tue, 07 Jan 2020 03:46:25 GMT
 From my understanding, when you want to use sortbyposition in tika, you 
need to have a segment like this:

...
         <parser class="org.apache.tika.parser.pdf.PDFParser">
             <params>
                 <param name="sortByPosition" type="bool">true</param>
             </params>
         </parser>
...

so your whole file would be like:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
   <parsers>
     <!-- Default Parser for most things, except for 2 mime types, and never
          use the Executable Parser -->
     <parser class="org.apache.tika.parser.DefaultParser">
       <mime-exclude>application/pdf</mime-exclude>
     </parser>
     <!-- Use a different parser for PDF -->
     <parser class="org.apache.tika.parser.pdf.PDFParser">
        <mime>application/pdf</mime>
        <params>
         <param name="sortByPosition" type="bool">true</param>
       </params>
     </parser>
   </parsers>
</properties>


I just tried this file with tika-app. The default didn't sort, using 
this did sort. I added " --config=config.xml" at the command line.

Tilman

Am 07.01.2020 um 00:04 schrieb Lu Sun:
> Dear PDFBox Dev Team,
>
> After searching through online
> <https://stackoverflow.com/search?page=5&tab=Relevance&q=pdfbox%20order>,
I
> am certain that using setSortByPosition(true) would help. However, I am
> struggling to get the config file right. Can you please provide any advice
> on it?
>
> Thanks so much in advance. Regards, Luke
>
> On Fri, 20 Dec 2019 at 18:06, Lu Sun <vistaxjtu@gmail.com> wrote:
>
>> Dear PDFBox Dev Team,
>>
>> Hope this message finds you well.
>>
>> Just wanted to raise this for your attention. Please can you provide any
>> solutions on the parsing order issue? Attached is my config file, an
>> example of pdf file and my parsing results.
>>
>> Thanks so much in advance. Wish you and your team a Merry Christmas and
>> Happy New Year.
>>
>> Regards,
>> Luke
>>
>> On Tue, 17 Dec 2019 at 12:34, Tim Allison <tallison@apache.org> wrote:
>>
>>> PDFBox Colleagues,
>>>    Any recommendations?
>>>
>>> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vistaxjtu@gmail.com> wrote:
>>>
>>>> Dear Tika Dev Team,
>>>>
>>>>
>>>>
>>>> Hope this email finds you well.
>>>>
>>>>
>>>>
>>>> I have been actively using Tika for pdf file reading. One issue I found
>>>> is the parsing order. As shown in attached image, the parsing order of pdf
>>>> file is not  based on position of texts.
>>>>
>>>>
>>>>
>>>> As suggested in this github link
>>>> <https://github.com/chrismattmann/tika-python/issues/266>, I used a
>>>> customized config file (see attached), hoping to solve the issue. But this
>>>> has not worked out. If any chance, can you please review this issue, and
>>>> provide any insights or solutions?
>>>>
>>>>
>>>>
>>>> Thanks so much in advance.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Luke
>>>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message