nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Burgess <mattyb...@gmail.com>
Subject Re: Using Apache Nifi and Tika to extract content from pdf
Date Sat, 20 Feb 2016 19:34:02 GMT
Clojure libraries (or any JARs) can be used by the supported scripting languages. However Clojure
itself is not yet supported by the NiFi scripting processors, there were issues with the Clojure
ScriptEngine bridge so it was left off the original list. If there is interest in adding Clojure,
I can write up an improvement Jira with the initial findings.

Regards,
Matt


> On Feb 20, 2016, at 2:18 PM, Russell Whitaker <russell.whitaker@gmail.com> wrote:
> 
> Don't forget Clojure as well. 
> 
> Russell Whitaker
> Sent from my iPhone
> 
>> On Feb 20, 2016, at 7:44 AM, Matt Burgess <mattyb149@gmail.com> wrote:
>> 
>> I have a blog post on how to do this with NiFi using a Groovy script in the ExecuteScript
(new in 0.5.0) processor using PDFBox instead of Tika:
>> 
>> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
>> 
>> Jython is also supported but can't yet use Java libraries (it uses Jython scripts/modules
instead). The other languages (Groovy, Lua, JavaScript, JRuby) can use Java libraries like
Tika and PDFBox.
>> 
>> Regards,
>> Matt
>> 
>> Sent from my iPhone
>> 
>>> On Feb 20, 2016, at 10:31 AM, Ralf Meier <news@cht3.com> wrote:
>>> 
>>> Hi Everybody, 
>>> 
>>> I’m new to Nifi and I want to find out if it is possible to extract content
and metadata from PDF’s using a library like tika. 
>>> My first Idea was to to use the following processors:
>>> - GetFile (Watch a specific Folder)
>>> - IdentifyMimeType (Identify if the file is a typ application/pdf) 
>>> - RouteOnAttribute (If it is a pdf)
>>> - ExecuteStreamCommand:
>>> 	I changed the following settings.
>>> 	Command Arguments: {flowfilw_contents}
>>> 	Command Path: tika-python parse all
>>> 	
>>> I use the python tika wrapper from (https://github.com/chrismattmann/tika-python)
>>> 
>>> But it is not working. 
>>> Has somebody an Idea how to use tika to extract the content and the metadata
using nifi or what I’m doing wrong.
>>> 
>>> Thanks for your help.
>>> BR 
>>> Ralf

Mime
View raw message