nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russell Whitaker <russell.whita...@gmail.com>
Subject Re: Using Apache Nifi and Tika to extract content from pdf
Date Sat, 20 Feb 2016 19:18:24 GMT
Don't forget Clojure as well. 

Russell Whitaker
Sent from my iPhone

> On Feb 20, 2016, at 7:44 AM, Matt Burgess <mattyb149@gmail.com> wrote:
> 
> I have a blog post on how to do this with NiFi using a Groovy script in the ExecuteScript
(new in 0.5.0) processor using PDFBox instead of Tika:
> 
> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
> 
> Jython is also supported but can't yet use Java libraries (it uses Jython scripts/modules
instead). The other languages (Groovy, Lua, JavaScript, JRuby) can use Java libraries like
Tika and PDFBox.
> 
> Regards,
> Matt
> 
> Sent from my iPhone
> 
>> On Feb 20, 2016, at 10:31 AM, Ralf Meier <news@cht3.com> wrote:
>> 
>> Hi Everybody, 
>> 
>> I’m new to Nifi and I want to find out if it is possible to extract content and
metadata from PDF’s using a library like tika. 
>> My first Idea was to to use the following processors:
>> - GetFile (Watch a specific Folder)
>> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
>> - RouteOnAttribute (If it is a pdf)
>> - ExecuteStreamCommand:
>> 	I changed the following settings.
>> 	Command Arguments: {flowfilw_contents}
>> 	Command Path: tika-python parse all
>> 	
>> I use the python tika wrapper from (https://github.com/chrismattmann/tika-python)
>> 
>> But it is not working. 
>> Has somebody an Idea how to use tika to extract the content and the metadata using
nifi or what I’m doing wrong.
>> 
>> Thanks for your help.
>> BR 
>> Ralf

Mime
View raw message