nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Wilson <wilson...@gmail.com>
Subject Re: ExtractText usage
Date Wed, 09 Sep 2015 12:55:44 GMT
Bryan, thank you for the template, I'll look through that today and see if
that will do the trick.  The multi-line regex for capturing all lines
beginning with "R and ending with the next line which begins with "S" is
below.

# multi-line
# turn off greedy matches
# lookahead for lines beginning with "S"

(?m)(\"R\"\,.*?)(?=^\"S\"\,)

There's some weirdness with PutFile in that when I replace the text with
8-10 lines (depending on input) I get duplicate lines in the output file
(from 8 lines to 3800 lines)?  Very strange.

Mark, that's what I was thinking when I first pulled down ExtractText.  I
think the processor you describe would be incredibly useful for these kinds
of cases and maybe we can just extend Aldrin's JIRA (
https://issues.apache.org/jira/browse/NIFI-921) to fully flesh that out?
I'm happy to help with the the JIRA or testing as needed.

My two cents with NiFi thus far is I would have expected things like text
parsing and simple file IO to be there.  Like the issue with PutFile, my
expectation with that processor would it would "put a new file" and clobber
the contents with the new attribute passed in or append if specified.  I
could do a lot with access to stdin/stdout and it seems like I have to
extend that functionality with additional scripting.

Again, I'm just getting up to speed and thank you all for the help.

-Chris

On Tue, Sep 8, 2015 at 11:00 PM, Aldrin Piri <aldrinpiri@gmail.com> wrote:

> Mark,
>
> The need is certainly there. This core functionality is quite similar to
> what NIFI-921 (sorry for no link, doing mobile) is after. The issue that
> both of these are addressing is that there isn't a simple way to promote
> "simple" data to an attribute. There are some important distinctions
> between this functionality and those in the ticket but the core need is the
> same. May not smash into one processor but if this generates a ticket
> please link to consider both in the process.
> On Tue, Sep 8, 2015 at 20:25 Mark Payne <markap14@hotmail.com> wrote:
>
>> I'm wondering if we should perhaps look into building a RouteText
>> processor.
>>
>> The way that I could see this working is to have a few different
>> properties:
>>
>> Routing Strategy:
>> - Route matching lines to 'matched' if all match
>> - Route matching lines to 'matched' if any match
>> - Route each line to matching Property Name
>> - Route FlowFIle to 'matched' if all lines match
>> - Route FlowFile to 'matched' if any line matches
>>
>> A Match Strategy
>> - Starts With
>> - Ends With
>> - Contains
>> - Matches Regular Expression
>> - Equals
>>
>> And then user-defined properties to search for.
>>
>> So to find lines that begin with "R
>>
>> You would simply add a property named "Begins with R" and set the value
>> to : "R
>> Then set the Match Strategy to Starts With
>> And Routing Strategy to "Route each line to matching Property Name"
>>
>> Then, any line that begins with "R will be routed to the Begins with R
>> relationship.
>> This would be a simple way to pull out any particular lines of interest
>> in a text file.
>>
>> I can see this being very useful for processing log files, CSV, etc.
>>
>>
>>
>>
>> ________________________________
>> > Date: Tue, 8 Sep 2015 17:02:54 -0400
>> > Subject: Re: ExtractText usage
>> > From: bbende@gmail.com
>> > To: users@nifi.apache.org
>> >
>> > Chris,
>> >
>> > Thanks for the detailed explanation. It sounds like you are headed down
>> > the right path and looking at the right processors.
>> >
>> > I attempted to create a flow that routes certain lines of your example
>> > data down a "high priority" route based on the content, and all the
>> > others to a different route where they could be stored somewhere for
>> > later. I put a template for the flow on our Wiki page [1].
>> > You could think of the UpdateAttribute processors at the end as being
>> > replaced by the things you mentioned, such as processors to store in
>> > Hive or put to Kafka.
>> >
>> > If you haven't used a template before, the user guide has some good
>> > information here [2] and here [3].
>> >
>> > Let us know if there is anything else we can do to help.
>> >
>> > -Bryan
>> >
>> > [1]
>> >
>> https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates
>> > [2]
>> >
>> https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Import_Template
>> > [3]
>> https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#instantiating-a-template
>> >
>> > On Tue, Sep 8, 2015 at 4:17 PM, Christopher Wilson
>> > <wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com>> wrote:
>> >
>> > This [toy] example was to learn how this system works, to be honest.
>> > I'm glad I used it because the multi-line regex caught me off guard.
>> > To be honest, developing these is always a combination of regexpal,
>> > grep, and some other hacky-ness that's painful. If I knew more about
>> > the regex engine going in that'd be nice to have in the docs but long
>> > term if NiFi is going to require high-fidelity content matches then
>> > pulling in regexpal/regexr would be a good thing.
>> >
>> > The [real] case(s) I'm going to be working on once I understand this
>> > better deals with a variety of log data - not all of it created equal -
>> > which is why I started here. The current case is log data where
>> > multiple applications *may* write to the same log file (syslog) with
>> > different payloads (strings, json, etc). I'd rather not build the
>> > routing functions outside of NiFi in code
>> > (rsyslog/Python/Kafka/Spark/etc) and use the security/provenance
>> > mechanisms that are part of this system and pipeline that data out to
>> > other places - whether they be files, Hive, HBase, websockets, etc.
>> >
>> > This should be simple to implement since it's just stdout and not too
>> > dissimilar to how you read from a Kafka topic. In fact, that's
>> > probably the path I'll go down initially - but would like some solution
>> > for those files that don't fit this model.
>> >
>> > What I'd like to do is provide a list of regex's and RouteOnMatch and
>> > perform some action like insert into Hive/HBase, send to Spark/Kafka,
>> > or other processors. Imagine a critical syslog alert that MUST go to a
>> > critical service desk queue as opposed to *.info which might just pipe
>> > into a Hive table for exploratory analysis later.
>> >
>> > I know RouteOnContent has this capability and I will most likely pipe
>> > syslog/Kafka data initially - but as I said, not all log files are
>> > equal and there may be some that just get read in a la carte and
>> > discriminating between line 10 and 100 may be important. I also think
>> > that adding an attribute and then sending it along could be
>> > short-circuited with just a "match and forward" mechanism rather than
>> > copying content into attributes which again goes back into the
>> > hit-or-miss regex machine.
>> >
>> > I don't know about the impact of multiple FlowFiles but is there an
>> > accumulator that will allow me to take N lines and accumulate them into
>> > a single flow file?
>> >
>> > -Chris
>> >
>> >
>> > On Tue, Sep 8, 2015 at 3:00 PM, Bryan Bende
>> > <bbende@gmail.com<mailto:bbende@gmail.com>> wrote:
>> > Chris,
>> >
>> > After you extract the lines you are interested in, what do you want to
>> > do with the data after that? are you delivering to another system?
>> > performing more processing on the data?
>> >
>> > Just want to make sure we fully understand the scenario so we can offer
>> > the best possible solution.
>> >
>> > Thanks,
>> >
>> > Bryan
>> >
>> > On Tue, Sep 8, 2015 at 2:43 PM, Christopher Wilson
>> > <wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com>> wrote:
>> > I've moved the ball a bit closer to the goal - I enabled DOTALL Mode
>> > and increased the Capture Group Length to 4096. That grabs everything
>> > from the first line beginning with "R" to some of the "S"'s.
>> >
>> > Having a bit of trouble terminating the regex though.
>> >
>> > Once I get that sorted I'll post the result, but I have to say that the
>> > capture group length could be problematic "in the wild". In a perfect
>> > world you would know the length up front - but I can see plenty of
>> > cases where that's not going to be the case.
>> >
>> > -Chris
>> >
>> > On Tue, Sep 8, 2015 at 2:05 PM, Mark Payne
>> > <markap14@hotmail.com<mailto:markap14@hotmail.com>> wrote:
>> > Agreed. Bryan's suggestion will give you the ability to match each line
>> > against the regex,
>> > rather than trying to match the entire file. It would result in a new
>> > FlowFile for each line of
>> > text, though, as he said. But if you need to rebuild a single file,
>> > those could potentially be
>> > merged together using a MergeContent processor, as well.
>> >
>> > ________________________________
>> >> Date: Tue, 8 Sep 2015 13:03:08 -0400
>> >> Subject: Re: ExtractText usage
>> >> From: bbende@gmail.com<mailto:bbende@gmail.com>
>> >> To: users@nifi.apache.org<mailto:users@nifi.apache.org>
>> >>
>> >> Chris,
>> >>
>> >> I think the issue is that ExtractText is not reading the file line by
>> >> line, and then applying your pattern to each line. It is applying the
>> >> pattern to the whole content of the file so you would need a regex that
>> >> repeated the pattern you were looking for so that it captured multiple
>> >> times.
>> >>
>> >> When I tested your example, it was actually extracting the first match
>> >> 3 times which I think is because of the following...
>> >> - It always puts the first match in the property base name, in this
>> >> case "regex",
>> >> - then it puts the entire match in index 0, in this case regex.0, and
>> >> in this case it is only matching the first occurrence
>> >> - and then all of the matches would be in order after that staring with
>> >> index 1, which in this case there is only 1 match so it is just regex.1
>> >>
>> >> Another solution that might simpler is to put a SplitText processor
>> >> between GetFile and ExtractText, and set the Line Split Count to 1.
>> >> This will send 1 line at a time to your ExtractTextProcessor which
>> >> would then match only the lines starting with 'R'.
>> >> The downside is that all of the lines with 'R' would be in different
>> >> FlowFiles, but this may or may not matter depending what you wanted to
>> >> do with them after.
>> >>
>> >> -Bryan
>> >>
>> >>
>> >> On Tue, Sep 8, 2015 at 12:12 PM, Christopher Wilson
>> >>
>> > <wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com><mailto:
>> wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com>>>
>> > wrote:
>> >> I'm trying to read a directory of .csv files which have 3 different
>> >> schemas/list types (not my idea). The descriptor is in the first
>> >> column of the csv file. I'm reading the files in using GetFile and
>> >> passing them into ExtractText, but I'm only getting the first 3 (of 8)
>> >> lines matching my first regex. What I want to do is grab all the lines
>> >> beginning with "R" and dump them off to a file (for now). My end goal
>> >> would be to loop through these grab lines, or blocks of lines, by regex
>> >> and route them downstream based on that regex.
>> >>
>> >> Details and first 11 lines of a sample file below.
>> >>
>> >> Thanks in advance.
>> >>
>> >> -Chris
>> >>
>> >> NiFi version: 0.2.1
>> >> OS: Ubuntu 14.01
>> >> JVM: java-1.7.0-openjdk-amd64
>> >>
>> >> ExtractText:
>> >>
>> >> Enable Multiline = True
>> >> Enable Unix Lines Mode = True
>> >> regex = ^("R.*)$
>> >>
>> >>
>> >> "H","USA","BP","20140502","9","D","BP"
>> >> "R","1","TB","CLM"," "," ","3U"," ","47000","0","47000","0"," ","0","
>> >> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","25000","25000","
>> >> ","650","F","D","D","6"," "," "," ","1:20PM ","1:51PM ","0122"," ","Clm
>> >> 25000","Fast","","16","87","
>> >> ","","","64","117.39","2266","4648","11129","0","0","
>> >> ","","112089","Good","Cloudy","","","Y"
>> >> "R","2","TB","CLM"," ","B","3U"," ","34000","0","34000","0"," ","0","
>> >> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","25000","25000","
>> >> ","600","F","D","D","7"," "," "," ","1:51PM ","2:22PM ","0151"," ","Clm
>> >> 25000N2L","Fast","","16","79","
>> >> ","","","64","112.36","2444","4803","10003","0","0","
>> >> ","","261868","Poor","Cloudy","","","Y"
>> >> "R","3","TB","STK","S"," ","3U","
>> >> ","100000","0","100000","0","A","100000"," ","0"," ","0"," ","0","
>> >> ","0"," ","0"," ","0","0","0"," ","600","F","D","D","6"," ","Affirmed
>> >> Success S.","AfrmdScsB","2:22PM ","2:53PM ","0222","
>> >> ","AfrmdScsB100k","Fast","","16","88","
>> >> ","","","64","110.54","2323","4618","5810","0","0","
>> >> ","","259015","5","Clear","","","Y"
>> >> "R","4","TB","MCL"," "," ","3U"," ","49200","0","49200","0"," ","0","
>> >> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","40000","40000","
>> >> ","850","F","D","D","8"," "," "," ","2:53PM ","3:24PM ","0253"," ","Md
>> >> 40000","Fast","Y","30","72","
>> >> ","","","64","145.58","2425","4829","11358","13909","0","
>> >> ","","260343","9","Clear","0","","Y"
>> >> "R","5","TB","ALW"," "," ","3U"," ","77000","0","77000","0"," ","0","
>> >> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0","
>> >> ","900","F","D","D","7"," "," "," ","3:24PM ","3:55PM ","0325"," ","Alw
>> >> 77000N1X","Fast","Y","30","74","
>> >> ","","","64","151.69","2330","4643","11156","13832","0","
>> >> ","","302065","Good","Clear","","","Y"
>> >> "R","6","TB","MSW","S","B","3U"," ","60000","1200","60000","0","
>> >> ","0"," ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0","
>> >> ","800","F","D","D","5"," "," "," ","3:55PM ","4:26PM ","0355"," ","Md
>> >> Sp Wt 58k","Fast","","30","61","
>> >> ","","","64","140.64","2481","4931","11477","0","0","
>> >> ","","161404","Good","Clear","","","Y"
>> >> "R","7","TB","CLM"," ","B","3U"," ","40000","0","40000","0"," ","0","
>> >> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","20000","20000","
>> >> ","800","F","D","D","6"," "," "," ","4:26PM ","4:57PM ","0427"," ","Clm
>> >> 20000","Fast","","30","68","
>> >> ","","","64","139.31","2337","4770","11402","0","0","
>> >> ","","344306","Good","Clear","","","Y"
>> >> "R","8","TB","ALW"," ","B","3U"," ","77000","0","77000","0"," ","0","
>> >> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0","
>> >> ","850","F","D","D","7"," "," "," ","4:57PM ","5:28PM ","0457"," ","Alw
>> >> 77000N1X","Fast","","30","76","
>> >> ","","","64","144.76","2416","4847","11365","13836","0","
>> >> ","","213021","Good","Clear","","","Y"
>> >> "R","9","TB","STR"," "," ","3U"," ","60000","0","60000","0"," ","0","
>> >> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","40000","
>> >> ","700","F","D","D","8"," "," "," ","5:28PM "," ","0528"," ","Alw
>> >> 40000s","Fast","Y","16","81","
>> >> ","","","64","124.66","2339","4740","11211","0","0","
>> >> ","","332649","6,8","Clear","0","","Y"
>> >>
>> "S","1","000008813341TB","Coolusive","20100124","KY","TB","Colt","Bay","Ice
>> >> Cool Kitty","2003","TB","Elusive Quality","1993","TB","Tomorrows
>> >> Cat","1995","TB","Gone
>> >>
>> West","1984","TB","122","0","L","","28200","Velasquez","Cornelio","H.","
>> >> ","Jacobson","David"," ","Drawing Away Stable and Jacobson, David","
>> >> "," ","265","N","
>> >>
>> >
>> ","0","N","5","5","3","3","4","0","0","1","1","1","10","200","0","0","100","75","510","320","0","0","0","0","N","25000","4w
>> >> into lane, held","chase 2o turn, bid 4w turning for home,took over,
>> >> held
>> >>
>> sway","7.30","3.80","2.70","Y","000000002103TE","TE","Barbara","Robert","
>> >> ","000001976480O6","O6","Averill","Bradley","E.","
>> >> ","N","0","N","","0","","87","Lansdon B. Robbins & Kevin
>> >> Callahan","000000257611TE","000000002695JE"
>> >>
>> >>
>> >
>> >
>> >
>> >
>> >
>>
>
>

Mime
View raw message