nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russell Whitaker <russell.whita...@gmail.com>
Subject Re: Using Apache Nifi and Tika to extract content from pdf
Date Sat, 20 Feb 2016 20:09:19 GMT
Yes! I, for one, will weigh in with my interest in Clojure support in
the scripting processors.

Russell

On Sat, Feb 20, 2016 at 11:34 AM, Matt Burgess <mattyb149@gmail.com> wrote:
> Clojure libraries (or any JARs) can be used by the supported scripting
> languages. However Clojure itself is not yet supported by the NiFi scripting
> processors, there were issues with the Clojure ScriptEngine bridge so it was
> left off the original list. If there is interest in adding Clojure, I can
> write up an improvement Jira with the initial findings.
>
> Regards,
> Matt
>
>
> On Feb 20, 2016, at 2:18 PM, Russell Whitaker <russell.whitaker@gmail.com>
> wrote:
>
> Don't forget Clojure as well.
>
> Russell Whitaker
> Sent from my iPhone
>
> On Feb 20, 2016, at 7:44 AM, Matt Burgess <mattyb149@gmail.com> wrote:
>
> I have a blog post on how to do this with NiFi using a Groovy script in the
> ExecuteScript (new in 0.5.0) processor using PDFBox instead of Tika:
>
> http://funnifi.blogspot.com/2016/02/executescript-extract-text-metadata.html?m=1
>
> Jython is also supported but can't yet use Java libraries (it uses Jython
> scripts/modules instead). The other languages (Groovy, Lua, JavaScript,
> JRuby) can use Java libraries like Tika and PDFBox.
>
> Regards,
> Matt
>
> Sent from my iPhone
>
> On Feb 20, 2016, at 10:31 AM, Ralf Meier <news@cht3.com> wrote:
>
> Hi Everybody,
>
> I’m new to Nifi and I want to find out if it is possible to extract content
> and metadata from PDF’s using a library like tika.
> My first Idea was to to use the following processors:
> - GetFile (Watch a specific Folder)
> - IdentifyMimeType (Identify if the file is a typ application/pdf)
> - RouteOnAttribute (If it is a pdf)
> - ExecuteStreamCommand:
> I changed the following settings.
> Command Arguments: {flowfilw_contents}
> Command Path: tika-python parse all
> I use the python tika wrapper from
> (https://github.com/chrismattmann/tika-python)
>
> But it is not working.
> Has somebody an Idea how to use tika to extract the content and the metadata
> using nifi or what I’m doing wrong.
>
> Thanks for your help.
> BR
> Ralf



-- 
Russell Whitaker
http://twitter.com/OrthoNormalRuss
http://www.linkedin.com/pub/russell-whitaker/0/b86/329

Mime
View raw message