nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Bende <bbe...@gmail.com>
Subject Re: ExtractText usage
Date Tue, 08 Sep 2015 21:02:54 GMT
Chris,

Thanks for the detailed explanation. It sounds like you are headed down the
right path and looking at the right processors.

I attempted to create a flow that routes certain lines of your example data
down a "high priority" route based on the content, and all the others to a
different route where they could be stored somewhere for later. I put a
template for the flow on our Wiki page [1].
You could think of the UpdateAttribute processors at the end as being
replaced by the things you mentioned, such as processors to store in Hive
or put to Kafka.

If you haven't used a template before, the user guide has some good
information here [2] and here [3].

Let us know if there is anything else we can do to help.

-Bryan

[1]
https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates
[2]
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Import_Template
[3]
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#instantiating-a-template

On Tue, Sep 8, 2015 at 4:17 PM, Christopher Wilson <wilsoncj1@gmail.com>
wrote:

>
> This [toy] example was to learn how this system works, to be honest.  I'm
> glad I used it because the multi-line regex caught me off guard.  To be
> honest, developing these is always a combination of regexpal, grep, and
> some other hacky-ness that's painful.  If I knew more about the regex
> engine going in that'd be nice to have in the docs but long term if NiFi is
> going to require high-fidelity content matches then pulling in
> regexpal/regexr would be a good thing.
>
> The [real] case(s) I'm going to be working on once I understand this
> better deals with a variety of log data - not all of it created equal -
> which is why I started here.  The current case is log data where multiple
> applications *may* write to the same log file (syslog) with different
> payloads (strings, json, etc).  I'd rather not build the routing functions
> outside of NiFi in code (rsyslog/Python/Kafka/Spark/etc) and use the
> security/provenance mechanisms that are part of this system and pipeline
> that data out to other places - whether they be files, Hive, HBase,
> websockets, etc.
>
> This should be simple to implement since it's just stdout and not too
> dissimilar to how you read from a Kafka topic.  In fact, that's probably
> the path I'll go down initially - but would like some solution for those
> files that don't fit this model.
>
> What I'd like to do is provide a list of regex's and RouteOnMatch and
> perform some action like insert into Hive/HBase, send to Spark/Kafka, or
> other processors.  Imagine a critical syslog alert that MUST go to a
> critical service desk queue as opposed to *.info which might just pipe into
> a Hive table for exploratory analysis later.
>
> I know RouteOnContent has this capability and I will most likely pipe
> syslog/Kafka data initially - but as I said, not all log files are equal
> and there may be some that just get read in a la carte and discriminating
> between line 10 and 100 may be important.  I also think that adding an
> attribute and then sending it along could be short-circuited with just a
> "match and forward" mechanism rather than copying content into attributes
> which again goes back into the hit-or-miss regex machine.
>
> I don't know about the impact of multiple FlowFiles but is there an
> accumulator that will allow me to take N lines and accumulate them into a
> single flow file?
>
> -Chris
>
>
> On Tue, Sep 8, 2015 at 3:00 PM, Bryan Bende <bbende@gmail.com> wrote:
>
>> Chris,
>>
>> After you extract the lines you are interested in, what do you want to do
>> with the data after that? are you delivering to another system? performing
>> more processing on the data?
>>
>> Just want to make sure we fully understand the scenario so we can offer
>> the best possible solution.
>>
>> Thanks,
>>
>> Bryan
>>
>> On Tue, Sep 8, 2015 at 2:43 PM, Christopher Wilson <wilsoncj1@gmail.com>
>> wrote:
>>
>>> I've moved the ball a bit closer to the goal - I enabled DOTALL Mode and
>>> increased the Capture Group Length to 4096.  That grabs everything from the
>>> first line beginning with "R" to some of the "S"'s.
>>>
>>> Having a bit of trouble terminating the regex though.
>>>
>>> Once I get that sorted I'll post the result, but I have to say that the
>>> capture group length could be problematic "in the wild".  In a perfect
>>> world you would know the length up front - but I can see plenty of cases
>>> where that's not going to be the case.
>>>
>>> -Chris
>>>
>>> On Tue, Sep 8, 2015 at 2:05 PM, Mark Payne <markap14@hotmail.com> wrote:
>>>
>>>> Agreed. Bryan's suggestion will give you the ability to match each line
>>>> against the regex,
>>>> rather than trying to match the entire file. It would result in a new
>>>> FlowFile for each line of
>>>> text, though, as he said. But if you need to rebuild a single file,
>>>> those could potentially be
>>>> merged together using a MergeContent processor, as well.
>>>>
>>>> ________________________________
>>>> > Date: Tue, 8 Sep 2015 13:03:08 -0400
>>>> > Subject: Re: ExtractText usage
>>>> > From: bbende@gmail.com
>>>> > To: users@nifi.apache.org
>>>> >
>>>> > Chris,
>>>> >
>>>> > I think the issue is that ExtractText is not reading the file line by
>>>> > line, and then applying your pattern to each line. It is applying the
>>>> > pattern to the whole content of the file so you would need a regex
>>>> that
>>>> > repeated the pattern you were looking for so that it captured multiple
>>>> > times.
>>>> >
>>>> > When I tested your example, it was actually extracting the first match
>>>> > 3 times which I think is because of the following...
>>>> > - It always puts the first match in the property base name, in this
>>>> > case "regex",
>>>> > - then it puts the entire match in index 0, in this case regex.0, and
>>>> > in this case it is only matching the first occurrence
>>>> > - and then all of the matches would be in order after that staring
>>>> with
>>>> > index 1, which in this case there is only 1 match so it is just
>>>> regex.1
>>>> >
>>>> > Another solution that might simpler is to put a SplitText processor
>>>> > between GetFile and ExtractText, and set the Line Split Count to 1.
>>>> > This will send 1 line at a time to your ExtractTextProcessor which
>>>> > would then match only the lines starting with 'R'.
>>>> > The downside is that all of the lines with 'R' would be in different
>>>> > FlowFiles, but this may or may not matter depending what you wanted
to
>>>> > do with them after.
>>>> >
>>>> > -Bryan
>>>> >
>>>> >
>>>> > On Tue, Sep 8, 2015 at 12:12 PM, Christopher Wilson
>>>> > <wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com>> wrote:
>>>> > I'm trying to read a directory of .csv files which have 3 different
>>>> > schemas/list types (not my idea). The descriptor is in the first
>>>> > column of the csv file. I'm reading the files in using GetFile and
>>>> > passing them into ExtractText, but I'm only getting the first 3 (of
8)
>>>> > lines matching my first regex. What I want to do is grab all the lines
>>>> > beginning with "R" and dump them off to a file (for now). My end goal
>>>> > would be to loop through these grab lines, or blocks of lines, by
>>>> regex
>>>> > and route them downstream based on that regex.
>>>> >
>>>> > Details and first 11 lines of a sample file below.
>>>> >
>>>> > Thanks in advance.
>>>> >
>>>> > -Chris
>>>> >
>>>> > NiFi version: 0.2.1
>>>> > OS: Ubuntu 14.01
>>>> > JVM: java-1.7.0-openjdk-amd64
>>>> >
>>>> > ExtractText:
>>>> >
>>>> > Enable Multiline = True
>>>> > Enable Unix Lines Mode = True
>>>> > regex = ^("R.*)$
>>>> >
>>>> >
>>>> > "H","USA","BP","20140502","9","D","BP"
>>>> > "R","1","TB","CLM"," "," ","3U"," ","47000","0","47000","0"," ","0","
>>>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","25000","25000","
>>>> > ","650","F","D","D","6"," "," "," ","1:20PM ","1:51PM ","0122","
>>>> ","Clm
>>>> > 25000","Fast","","16","87","
>>>> > ","","","64","117.39","2266","4648","11129","0","0","
>>>> > ","","112089","Good","Cloudy","","","Y"
>>>> > "R","2","TB","CLM"," ","B","3U"," ","34000","0","34000","0"," ","0","
>>>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","25000","25000","
>>>> > ","600","F","D","D","7"," "," "," ","1:51PM ","2:22PM ","0151","
>>>> ","Clm
>>>> > 25000N2L","Fast","","16","79","
>>>> > ","","","64","112.36","2444","4803","10003","0","0","
>>>> > ","","261868","Poor","Cloudy","","","Y"
>>>> > "R","3","TB","STK","S"," ","3U","
>>>> > ","100000","0","100000","0","A","100000"," ","0"," ","0"," ","0","
>>>> > ","0"," ","0"," ","0","0","0"," ","600","F","D","D","6"," ","Affirmed
>>>> > Success S.","AfrmdScsB","2:22PM ","2:53PM ","0222","
>>>> > ","AfrmdScsB100k","Fast","","16","88","
>>>> > ","","","64","110.54","2323","4618","5810","0","0","
>>>> > ","","259015","5","Clear","","","Y"
>>>> > "R","4","TB","MCL"," "," ","3U"," ","49200","0","49200","0"," ","0","
>>>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","40000","40000","
>>>> > ","850","F","D","D","8"," "," "," ","2:53PM ","3:24PM ","0253"," ","Md
>>>> > 40000","Fast","Y","30","72","
>>>> > ","","","64","145.58","2425","4829","11358","13909","0","
>>>> > ","","260343","9","Clear","0","","Y"
>>>> > "R","5","TB","ALW"," "," ","3U"," ","77000","0","77000","0"," ","0","
>>>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0","
>>>> > ","900","F","D","D","7"," "," "," ","3:24PM ","3:55PM ","0325","
>>>> ","Alw
>>>> > 77000N1X","Fast","Y","30","74","
>>>> > ","","","64","151.69","2330","4643","11156","13832","0","
>>>> > ","","302065","Good","Clear","","","Y"
>>>> > "R","6","TB","MSW","S","B","3U"," ","60000","1200","60000","0","
>>>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0","
>>>> > ","800","F","D","D","5"," "," "," ","3:55PM ","4:26PM ","0355"," ","Md
>>>> > Sp Wt 58k","Fast","","30","61","
>>>> > ","","","64","140.64","2481","4931","11477","0","0","
>>>> > ","","161404","Good","Clear","","","Y"
>>>> > "R","7","TB","CLM"," ","B","3U"," ","40000","0","40000","0"," ","0","
>>>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","20000","20000","
>>>> > ","800","F","D","D","6"," "," "," ","4:26PM ","4:57PM ","0427","
>>>> ","Clm
>>>> > 20000","Fast","","30","68","
>>>> > ","","","64","139.31","2337","4770","11402","0","0","
>>>> > ","","344306","Good","Clear","","","Y"
>>>> > "R","8","TB","ALW"," ","B","3U"," ","77000","0","77000","0"," ","0","
>>>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0","
>>>> > ","850","F","D","D","7"," "," "," ","4:57PM ","5:28PM ","0457","
>>>> ","Alw
>>>> > 77000N1X","Fast","","30","76","
>>>> > ","","","64","144.76","2416","4847","11365","13836","0","
>>>> > ","","213021","Good","Clear","","","Y"
>>>> > "R","9","TB","STR"," "," ","3U"," ","60000","0","60000","0"," ","0","
>>>> > ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","40000","
>>>> > ","700","F","D","D","8"," "," "," ","5:28PM "," ","0528"," ","Alw
>>>> > 40000s","Fast","Y","16","81","
>>>> > ","","","64","124.66","2339","4740","11211","0","0","
>>>> > ","","332649","6,8","Clear","0","","Y"
>>>> >
>>>> "S","1","000008813341TB","Coolusive","20100124","KY","TB","Colt","Bay","Ice
>>>> > Cool Kitty","2003","TB","Elusive Quality","1993","TB","Tomorrows
>>>> > Cat","1995","TB","Gone
>>>> >
>>>> West","1984","TB","122","0","L","","28200","Velasquez","Cornelio","H.","
>>>> > ","Jacobson","David"," ","Drawing Away Stable and Jacobson, David","
>>>> > "," ","265","N","
>>>> >
>>>> ","0","N","5","5","3","3","4","0","0","1","1","1","10","200","0","0","100","75","510","320","0","0","0","0","N","25000","4w
>>>> > into lane, held","chase 2o turn, bid 4w turning for home,took over,
>>>> > held
>>>> >
>>>> sway","7.30","3.80","2.70","Y","000000002103TE","TE","Barbara","Robert","
>>>> > ","000001976480O6","O6","Averill","Bradley","E.","
>>>> > ","N","0","N","","0","","87","Lansdon B. Robbins & Kevin
>>>> > Callahan","000000257611TE","000000002695JE"
>>>> >
>>>> >
>>>>
>>>>
>>>
>>>
>>
>

Mime
View raw message