tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <mattm...@apache.org>
Subject Re: [EXTERNAL] Urgent!!! Tika-python
Date Mon, 19 Aug 2019 15:32:03 GMT
Hi,

 

Why not just do an os.walk or os.listdir in python, and then for each file, call Tika, e.g.,


 

import os

import json

from tika import parser

 

fs = os.listdir(‘/some/path’)

fs = [f for f in fs if os.isfile(f) and (str(f).endswith(‘.pdf’) or str(f).endswith(‘.doc’))]

 

for f in fs:

                parsed = parser.from_file(f)

                # save parsed to file

                json.dump(parsed, ‘/some/other/path’)

 

Cheers,

Chris

 

 

 

From: Victor Olaiya <vickolas433@gmail.com>
Date: Monday, August 19, 2019 at 8:28 AM
To: "Mattmann, Chris A (US 1761)" <chris.a.mattmann@jpl.nasa.gov>
Subject: [EXTERNAL] Urgent!!! Tika-python

 

Hello, 

I sent a mail to the mailing list with no response, so I decided to mail you again.

I have been trying to extract text from all pdfs and doc etc files in a directory and that
has been impossible as Tika-python does not allow parsing of directory only files.

I was able to compress the files in a single zip file and extract, this worked but the extracted
text where saved in a single file, i need the files to be saved in their individual files
so I can use them as input to another program.

 

Please what is the best method to go about this.

Thank you Chris Mattmann,

I await your reply.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message