tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <mattm...@apache.org>
Subject Re: [EXTERNAL] Urgent!!! Tika-python
Date Mon, 19 Aug 2019 15:32:03 GMT


Why not just do an os.walk or os.listdir in python, and then for each file, call Tika, e.g.,


import os

import json

from tika import parser


fs = os.listdir(‘/some/path’)

fs = [f for f in fs if os.isfile(f) and (str(f).endswith(‘.pdf’) or str(f).endswith(‘.doc’))]


for f in fs:

                parsed = parser.from_file(f)

                # save parsed to file

                json.dump(parsed, ‘/some/other/path’)







From: Victor Olaiya <vickolas433@gmail.com>
Date: Monday, August 19, 2019 at 8:28 AM
To: "Mattmann, Chris A (US 1761)" <chris.a.mattmann@jpl.nasa.gov>
Subject: [EXTERNAL] Urgent!!! Tika-python



I sent a mail to the mailing list with no response, so I decided to mail you again.

I have been trying to extract text from all pdfs and doc etc files in a directory and that
has been impossible as Tika-python does not allow parsing of directory only files.

I was able to compress the files in a single zip file and extract, this worked but the extracted
text where saved in a single file, i need the files to be saved in their individual files
so I can use them as input to another program.


Please what is the best method to go about this.

Thank you Chris Mattmann,

I await your reply.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message