nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Conrad Crampton <conrad.cramp...@SecData.com>
Subject Re: Nifi & Parsey McParseface! RegEx in a Processor...
Date Mon, 06 Jun 2016 11:30:59 GMT
Hi,
I’m not a NiFi expert by any stretch of the imagination and there others on this list far
better informed than me that can speak with authority on many of the questions you raise,
but I’ll have a go…

It is probably not necessary to  create a custom processor to do the parsing (using PMPF)
– your executescript processor probably is sufficient. The one reason that this may not
be desirable is if the Parsey model initialisation is expensive and therefore to do for each
script invocation would cause a bottleneck in processing, if it isn’t then using ListenKafka
-> ExecuteScript (Parsey) -> PutKafka would do what you want I would have though (conceptually).
However, what you are missing from this pipeline is the analysis of the Parsey output as you
say. Now this may be something that a custom processor would be suitable – quite a simple
text processing one using standard Java text processing / regexp to then write to a new flowfile
for putting back on Kafka queue.

If however you feel the Parsey being run via an ExecuteScript processor isn’t suitable then
I guess there are a number of options available – to make it thread safe etc. and available
from each node in your Nifi cluster in a consistent way, I would be inclined to wrap Parsey
up in an Http service and invoke via REST (as an idea) – posting in the data to parse and
receiving output – could even do the analysis to format the output appropriately (as Json
perhaps) to return back – invoked via GetHttp processor. This may all be able to be done
in custom processor too and probably the best option IF you can understand the Parsey model
initialisation within the custom processor.

In any case, my advice (for what it’s worth) would be turn to custom processors as last
resort and try and leverage the built in processors where possible. Whilst it is (fairly)
trivial (as you have found out) to write your own processor it comes with its own overhead
over time in maintenance etc. whereas using the built in ones come with a reassurance that
they are well tried and tested.

Sorry I can’t be more specific on your (very interesting) use case.

Regards
Conrad

From: Pat Trainor <pat.trainor@gmail.com>
Reply-To: "users@nifi.apache.org" <users@nifi.apache.org>
Date: Monday, 6 June 2016 at 12:02
To: "users@nifi.apache.org" <users@nifi.apache.org>
Subject: Re: Nifi & Parsey McParseface! RegEx in a Processor...


Conrad,

Thanks for writing! You do get the gist of it. Last night I realized how easy it is to make
a custom processor. I was a little confused at first why I needed to pass on a new Flowfile
in my simple onTrigger function, but the error in the Nifi GUI about versions/timestamp made
it obvious. I guess I wasn't thinking and didn't check the nifi logs!

Anyway, if I am correct, I might be able to add an attribute to an existing Flowfile from
my little processor. As of late last night I could change one that was there already, but
today I will try to create one. If I can, then this should go well.

Unfortunately, and tell me if I am wrong, this new processor will still need to be loaded
each time a sentence needs to be analyzed by 'Parsey'. On a small scale, this is no big deal,
but normally people would be hammering it.

In looking for a clean, fast and [hopefully] elegant solution to accessing running services
from a processor, is it bad design to simply make my parser run as a service, and have it
listen to Kafka for text to parse? It could send it back as well via another topic...

But that is only 1/2 the problem. The other 1/2 is parsing out the output from Parsey, and
maybe for that I should make my processor-not getting text sent & returned from Parsey...
Because storing the output of Parsey (text) isn't a direct operation (see the sample output
text in prev/original email), it's output needs to be analyzed first.

So let me know if this plan is viable:

  1.  Make the Parsey interaction via a java loop (daemon/service).
  2.  This daemon loads the Parsey model chosen once, then waits for Kafka messages to process,
outputting each on another Kafka topic. It expects to receive 3 things:

     *   Flowfile as text to parse.
     *   The Kafka Topic to listen to (processor can't configure this, but will reflect user's
choice).
     *   The Kafka Topic to send it back on (this I can send to the java daemon, and configure
each 'return' at runtime)

        *   This way, I am imagining many processors can send to Parsey via one fixed topic,
and they can each wait for the return data via a unique Topic for just that processor.
        *   I cannot see a way to adjust the listening Topic at runtime, so the user would
make one for all processors to use, then enter that as a processor attribute.

  1.  My simple processor sends a flowfile to it via the topic the user selects as a Processor
attribute "Send Topic".
  2.  The parser, well, parses. Then it sends back the reply on  a Topic set in the processor
as well as the "Receive Topic".

     *   Is it better to just do the Kafka transfer in the processor, or hand it off to PutKafka
& GetKafka? My thinking is that this would be harder to do, and I would need to write
2 processors... Thoughts?

  1.  The custom processor I'm writing then has the parsed text, but not in a format that
will allow it to be put into a [graph] database. Knowing a word is a NNP isn't enough-you
must know which branch on the tree it was (how important it is).

     *   This is where the [X] extraction counts, or a better mechanism that I'm not thinking
of.

  1.  At this point, I am very tempted to keep going in this processor, but what if the user
wants HDFS, Titan, ? Best here is to stop & put the results in it's own "relationship",
with the original text that was parsed in another, and perhaps even the 'raw parsed' tree-looking
text in another Relationship.

     *   So 4 relationships:

        *   Submitted
        *   Post Parsey
        *   Indexed
        *   Failure (of any of 2 or 3)



I will make the (Indexed) output of this processor a standard, of sorts, which another processor
can change into a query for the DB of choice. The 'tree level' could be used for logic like:

  1.  NNP/NNPS at [1] is a vertex.
  2.  NN/NNS > [2] are destination vertices of the above.
  3.  VBG at ROOT is an edge.
  4.  ...
Would it be OK to leave cobbling together their query to INSERT into their DB of choice to
them? Once such a query crafted, they can use any standard Nifi Put* processor, is my thinking...

Your feedback appreciated!
On Jun 6, 2016 3:18 AM, "Conrad Crampton" <conrad.crampton@secdata.com<mailto:conrad.crampton@secdata.com>>
wrote:
Hi,
This may be a long shot as I don’t know how many combinations of the column lengths with
| and + there are, but you could try using ReplaceTextWithMapping processor where you have
all combinations of +--| etc. in a text file with what they represent in term of counts e.g
+--           [0]
|  +--       [1]
|      +--   [3]

etc. (tab separated)

Also, I’m not a particularly experienced in the area of sed, awk etc. but I’m guessing
some bash guru would be able to come up with some sort of script that does this that could
be called from ExcecuteScript processor.

Regards
Conrad

From: Pat Trainor <pat.trainor@gmail.com<mailto:pat.trainor@gmail.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>>
Date: Sunday, 5 June 2016 at 18:33
To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" <users@nifi.apache.org<mailto:users@nifi.apache.org>>
Subject: Nifi & Parsey McParseface! RegEx in a Processor...

I have had success with using ReplaceText processor out of the box to modify the output of
a nifi-called script. I'm applying nifi to running the parsey mcparseface system (Syntaxnet)
from google. The ouput of the application looks like this:

---
Input: It is to two English scholars , father and son , Edward Pococke , senior and junior
, that the world is indebted for the knowledge of one of the most charming productions Arabian
philosophy can boast of .
Parse:
is VBZ ROOT
+-- It PRP nsubj
+-- to IN prep
|   +-- scholars NNS pobj
|       +-- two CD num
|       +-- English JJ amod
|       +-- , , punct
|       +-- father NN conj
|       |   +-- and CC cc
|       |   +-- son NN conj
|       +-- Pococke NNP appos
[...]
---

As you can see, my ExecuteProcessorStream is working fine. But there is a bit of importance
that needs to be taken from this text. My ReplaceText Processor used (the first one) is shown
in the attached. It only removes characters.

How many 'spaces' each of the '+' signs is is important. Simply removing leading spaces, +
and | characters moves the first word in each line to the first column, without telling you
how many columns over the words started in the original input.

WHat is needed is a way to count the number of columns in the beginning of each line that
precedes the first alphanumeric. It doesn't matter if the same processor can also clean things
out to my present efforts:

Input: It is to two English scholars , father and son , Edward Pococke , senior and junior
, that the world is indebted for the knowledge of one of the most charming productions Arabian
philosophy can boast of .
Parse:
is VBZ ROOT
It PRP nsubj
to IN prep
[...]

I am hoping to somehow use the expressions (a la ${line:blah...) in Nifi, or another mechanism
I'm not aware of, to gather the column count, making it available for later processing/storage.

[0]is VBZ ROOT
[1]It PRP nsubj
[1]to IN prep
[2] ...

With the [X] being the # of columns over from the left that the alpha-numeric character was.

The reasoning for this is that the position signifies how 'important' that attribute is in
the sentence. It looks like a tree, but the numer (indentation) is the length of the branch
the word is on.

Is there a clever way to accomplish most/all of this, either with () regex or named attributes,
in Nifi?

Thanks!
pat<http://about.me/PatTrainor>
( ͡° ͜ʖ ͡°)

"A wise man can learn more from a foolish question than a fool can learn from a wise answer".
~ Bruce Lee.


***This email originated outside SecureData***

Click here<https://www.mailcontrol.com/sr/MZbqvYs5QwJvpeaetUwhCQ==> to report this email
as spam.

SecureData, combating cyber threats

________________________________

The information contained in this message or any of its attachments may be privileged and
confidential and intended for the exclusive use of the intended recipient. If you are not
the intended recipient any disclosure, reproduction, distribution or other dissemination or
use of this communications is strictly prohibited. The views expressed in this email are those
of the individual and not necessarily of SecureData Europe Ltd. Any prices quoted are only
valid if followed up by a formal written quote.

SecureData Europe Limited. Registered in England & Wales 04365896. Registered Address:
SecureData House, Hermitage Court, Hermitage Lane, Maidstone, Kent, ME16 9NT
Mime
View raw message