nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Conrad Crampton <conrad.cramp...@SecData.com>
Subject Re: Nifi & Parsey McParseface! RegEx in a Processor...
Date Mon, 06 Jun 2016 07:18:16 GMT
Hi,
This may be a long shot as I don’t know how many combinations of the column lengths with
| and + there are, but you could try using ReplaceTextWithMapping processor where you have
all combinations of +--| etc. in a text file with what they represent in term of counts e.g
+--           [0]
|  +--       [1]
|      +--   [3]

etc. (tab separated)

Also, I’m not a particularly experienced in the area of sed, awk etc. but I’m guessing
some bash guru would be able to come up with some sort of script that does this that could
be called from ExcecuteScript processor.

Regards
Conrad

From: Pat Trainor <pat.trainor@gmail.com>
Reply-To: "users@nifi.apache.org" <users@nifi.apache.org>
Date: Sunday, 5 June 2016 at 18:33
To: "users@nifi.apache.org" <users@nifi.apache.org>
Subject: Nifi & Parsey McParseface! RegEx in a Processor...

I have had success with using ReplaceText processor out of the box to modify the output of
a nifi-called script. I'm applying nifi to running the parsey mcparseface system (Syntaxnet)
from google. The ouput of the application looks like this:

---
Input: It is to two English scholars , father and son , Edward Pococke , senior and junior
, that the world is indebted for the knowledge of one of the most charming productions Arabian
philosophy can boast of .
Parse:
is VBZ ROOT
+-- It PRP nsubj
+-- to IN prep
|   +-- scholars NNS pobj
|       +-- two CD num
|       +-- English JJ amod
|       +-- , , punct
|       +-- father NN conj
|       |   +-- and CC cc
|       |   +-- son NN conj
|       +-- Pococke NNP appos
[...]
---

As you can see, my ExecuteProcessorStream is working fine. But there is a bit of importance
that needs to be taken from this text. My ReplaceText Processor used (the first one) is shown
in the attached. It only removes characters.

How many 'spaces' each of the '+' signs is is important. Simply removing leading spaces, +
and | characters moves the first word in each line to the first column, without telling you
how many columns over the words started in the original input.

WHat is needed is a way to count the number of columns in the beginning of each line that
precedes the first alphanumeric. It doesn't matter if the same processor can also clean things
out to my present efforts:

Input: It is to two English scholars , father and son , Edward Pococke , senior and junior
, that the world is indebted for the knowledge of one of the most charming productions Arabian
philosophy can boast of .
Parse:
is VBZ ROOT
It PRP nsubj
to IN prep
[...]

I am hoping to somehow use the expressions (a la ${line:blah...) in Nifi, or another mechanism
I'm not aware of, to gather the column count, making it available for later processing/storage.

[0]is VBZ ROOT
[1]It PRP nsubj
[1]to IN prep
[2] ...

With the [X] being the # of columns over from the left that the alpha-numeric character was.

The reasoning for this is that the position signifies how 'important' that attribute is in
the sentence. It looks like a tree, but the numer (indentation) is the length of the branch
the word is on.

Is there a clever way to accomplish most/all of this, either with () regex or named attributes,
in Nifi?

Thanks!
pat<http://about.me/PatTrainor>
( ͡° ͜ʖ ͡°)

"A wise man can learn more from a foolish question than a fool can learn from a wise answer".
~ Bruce Lee.


***This email originated outside SecureData***

Click here<https://www.mailcontrol.com/sr/MZbqvYs5QwJvpeaetUwhCQ==> to report this email
as spam.


SecureData, combating cyber threats
______________________________________________________________________ 
The information contained in this message or any of its attachments may be privileged and
confidential and intended for the exclusive use of the intended recipient. If you are not
the intended recipient any disclosure, reproduction, distribution or other dissemination or
use of this communications is strictly prohibited. The views expressed in this email are those
of the individual and not necessarily of SecureData Europe Ltd. Any prices quoted are only
valid if followed up by a formal written quote.

SecureData Europe Limited. Registered in England & Wales 04365896. Registered Address:
SecureData House, Hermitage Court, Hermitage Lane, Maidstone, Kent, ME16 9NT
Mime
View raw message