nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Payne <marka...@hotmail.com>
Subject RE: ExtractText usage
Date Wed, 09 Sep 2015 16:38:47 GMT
All,

I did create a ticket for this: https://issues.apache.org/jira/browse/NIFI-942

And I linked it as related to NIFI-921.

Thanks
-Mark

________________________________
> From: aldrinpiri@gmail.com 
> Date: Wed, 9 Sep 2015 03:00:18 +0000 
> Subject: Re: ExtractText usage 
> To: users@nifi.apache.org 
> 
> Mark, 
> 
> The need is certainly there. This core functionality is quite similar 
> to what NIFI-921 (sorry for no link, doing mobile) is after. The issue 
> that both of these are addressing is that there isn't a simple way to 
> promote "simple" data to an attribute. There are some important 
> distinctions between this functionality and those in the ticket but the 
> core need is the same. May not smash into one processor but if this 
> generates a ticket please link to consider both in the process. 
> On Tue, Sep 8, 2015 at 20:25 Mark Payne 
> <markap14@hotmail.com<mailto:markap14@hotmail.com>> wrote: 
> I'm wondering if we should perhaps look into building a RouteText processor. 
> 
> The way that I could see this working is to have a few different properties: 
> 
> Routing Strategy: 
> - Route matching lines to 'matched' if all match 
> - Route matching lines to 'matched' if any match 
> - Route each line to matching Property Name 
> - Route FlowFIle to 'matched' if all lines match 
> - Route FlowFile to 'matched' if any line matches 
> 
> A Match Strategy 
> - Starts With 
> - Ends With 
> - Contains 
> - Matches Regular Expression 
> - Equals 
> 
> And then user-defined properties to search for. 
> 
> So to find lines that begin with "R 
> 
> You would simply add a property named "Begins with R" and set the value 
> to : "R 
> Then set the Match Strategy to Starts With 
> And Routing Strategy to "Route each line to matching Property Name" 
> 
> Then, any line that begins with "R will be routed to the Begins with R 
> relationship. 
> This would be a simple way to pull out any particular lines of interest 
> in a text file. 
> 
> I can see this being very useful for processing log files, CSV, etc. 
> 
> 
> 
> 
> ________________________________ 
>> Date: Tue, 8 Sep 2015 17:02:54 -0400 
>> Subject: Re: ExtractText usage 
>> From: bbende@gmail.com<mailto:bbende@gmail.com> 
>> To: users@nifi.apache.org<mailto:users@nifi.apache.org> 
>> 
>> Chris, 
>> 
>> Thanks for the detailed explanation. It sounds like you are headed down 
>> the right path and looking at the right processors. 
>> 
>> I attempted to create a flow that routes certain lines of your example 
>> data down a "high priority" route based on the content, and all the 
>> others to a different route where they could be stored somewhere for 
>> later. I put a template for the flow on our Wiki page [1]. 
>> You could think of the UpdateAttribute processors at the end as being 
>> replaced by the things you mentioned, such as processors to store in 
>> Hive or put to Kafka. 
>> 
>> If you haven't used a template before, the user guide has some good 
>> information here [2] and here [3]. 
>> 
>> Let us know if there is anything else we can do to help. 
>> 
>> -Bryan 
>> 
>> [1] 
>> https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates 
>> [2] 
>> https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Import_Template 
>> [3] 
> https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#instantiating-a-template

>> 
>> On Tue, Sep 8, 2015 at 4:17 PM, Christopher Wilson 
>> 
> <wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com><mailto:wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com>>>

> wrote: 
>> 
>> This [toy] example was to learn how this system works, to be honest. 
>> I'm glad I used it because the multi-line regex caught me off guard. 
>> To be honest, developing these is always a combination of regexpal, 
>> grep, and some other hacky-ness that's painful. If I knew more about 
>> the regex engine going in that'd be nice to have in the docs but long 
>> term if NiFi is going to require high-fidelity content matches then 
>> pulling in regexpal/regexr would be a good thing. 
>> 
>> The [real] case(s) I'm going to be working on once I understand this 
>> better deals with a variety of log data - not all of it created equal - 
>> which is why I started here. The current case is log data where 
>> multiple applications *may* write to the same log file (syslog) with 
>> different payloads (strings, json, etc). I'd rather not build the 
>> routing functions outside of NiFi in code 
>> (rsyslog/Python/Kafka/Spark/etc) and use the security/provenance 
>> mechanisms that are part of this system and pipeline that data out to 
>> other places - whether they be files, Hive, HBase, websockets, etc. 
>> 
>> This should be simple to implement since it's just stdout and not too 
>> dissimilar to how you read from a Kafka topic. In fact, that's 
>> probably the path I'll go down initially - but would like some solution 
>> for those files that don't fit this model. 
>> 
>> What I'd like to do is provide a list of regex's and RouteOnMatch and 
>> perform some action like insert into Hive/HBase, send to Spark/Kafka, 
>> or other processors. Imagine a critical syslog alert that MUST go to a 
>> critical service desk queue as opposed to *.info which might just pipe 
>> into a Hive table for exploratory analysis later. 
>> 
>> I know RouteOnContent has this capability and I will most likely pipe 
>> syslog/Kafka data initially - but as I said, not all log files are 
>> equal and there may be some that just get read in a la carte and 
>> discriminating between line 10 and 100 may be important. I also think 
>> that adding an attribute and then sending it along could be 
>> short-circuited with just a "match and forward" mechanism rather than 
>> copying content into attributes which again goes back into the 
>> hit-or-miss regex machine. 
>> 
>> I don't know about the impact of multiple FlowFiles but is there an 
>> accumulator that will allow me to take N lines and accumulate them into 
>> a single flow file? 
>> 
>> -Chris 
>> 
>> 
>> On Tue, Sep 8, 2015 at 3:00 PM, Bryan Bende 
>> 
> <bbende@gmail.com<mailto:bbende@gmail.com><mailto:bbende@gmail.com<mailto:bbende@gmail.com>>>

> wrote: 
>> Chris, 
>> 
>> After you extract the lines you are interested in, what do you want to 
>> do with the data after that? are you delivering to another system? 
>> performing more processing on the data? 
>> 
>> Just want to make sure we fully understand the scenario so we can offer 
>> the best possible solution. 
>> 
>> Thanks, 
>> 
>> Bryan 
>> 
>> On Tue, Sep 8, 2015 at 2:43 PM, Christopher Wilson 
>> 
> <wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com><mailto:wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com>>>

> wrote: 
>> I've moved the ball a bit closer to the goal - I enabled DOTALL Mode 
>> and increased the Capture Group Length to 4096. That grabs everything 
>> from the first line beginning with "R" to some of the "S"'s. 
>> 
>> Having a bit of trouble terminating the regex though. 
>> 
>> Once I get that sorted I'll post the result, but I have to say that the 
>> capture group length could be problematic "in the wild". In a perfect 
>> world you would know the length up front - but I can see plenty of 
>> cases where that's not going to be the case. 
>> 
>> -Chris 
>> 
>> On Tue, Sep 8, 2015 at 2:05 PM, Mark Payne 
>> 
> <markap14@hotmail.com<mailto:markap14@hotmail.com><mailto:markap14@hotmail.com<mailto:markap14@hotmail.com>>>

> wrote: 
>> Agreed. Bryan's suggestion will give you the ability to match each line 
>> against the regex, 
>> rather than trying to match the entire file. It would result in a new 
>> FlowFile for each line of 
>> text, though, as he said. But if you need to rebuild a single file, 
>> those could potentially be 
>> merged together using a MergeContent processor, as well. 
>> 
>> ________________________________ 
>>> Date: Tue, 8 Sep 2015 13:03:08 -0400 
>>> Subject: Re: ExtractText usage 
>>> From: 
> bbende@gmail.com<mailto:bbende@gmail.com><mailto:bbende@gmail.com<mailto:bbende@gmail.com>>

>>> To: 
> users@nifi.apache.org<mailto:users@nifi.apache.org><mailto:users@nifi.apache.org<mailto:users@nifi.apache.org>>

>>> 
>>> Chris, 
>>> 
>>> I think the issue is that ExtractText is not reading the file line by 
>>> line, and then applying your pattern to each line. It is applying the 
>>> pattern to the whole content of the file so you would need a regex that 
>>> repeated the pattern you were looking for so that it captured multiple 
>>> times. 
>>> 
>>> When I tested your example, it was actually extracting the first match 
>>> 3 times which I think is because of the following... 
>>> - It always puts the first match in the property base name, in this 
>>> case "regex", 
>>> - then it puts the entire match in index 0, in this case regex.0, and 
>>> in this case it is only matching the first occurrence 
>>> - and then all of the matches would be in order after that staring with 
>>> index 1, which in this case there is only 1 match so it is just regex.1 
>>> 
>>> Another solution that might simpler is to put a SplitText processor 
>>> between GetFile and ExtractText, and set the Line Split Count to 1. 
>>> This will send 1 line at a time to your ExtractTextProcessor which 
>>> would then match only the lines starting with 'R'. 
>>> The downside is that all of the lines with 'R' would be in different 
>>> FlowFiles, but this may or may not matter depending what you wanted to 
>>> do with them after. 
>>> 
>>> -Bryan 
>>> 
>>> 
>>> On Tue, Sep 8, 2015 at 12:12 PM, Christopher Wilson 
>>> 
>> 
> <wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com><mailto:wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com>><mailto:wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com><mailto:wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com>>>>

>> wrote: 
>>> I'm trying to read a directory of .csv files which have 3 different 
>>> schemas/list types (not my idea). The descriptor is in the first 
>>> column of the csv file. I'm reading the files in using GetFile and 
>>> passing them into ExtractText, but I'm only getting the first 3 (of 8) 
>>> lines matching my first regex. What I want to do is grab all the lines 
>>> beginning with "R" and dump them off to a file (for now). My end goal 
>>> would be to loop through these grab lines, or blocks of lines, by regex 
>>> and route them downstream based on that regex. 
>>> 
>>> Details and first 11 lines of a sample file below. 
>>> 
>>> Thanks in advance. 
>>> 
>>> -Chris 
>>> 
>>> NiFi version: 0.2.1 
>>> OS: Ubuntu 14.01 
>>> JVM: java-1.7.0-openjdk-amd64 
>>> 
>>> ExtractText: 
>>> 
>>> Enable Multiline = True 
>>> Enable Unix Lines Mode = True 
>>> regex = ^("R.*)$ 
>>> 
>>> 
>>> "H","USA","BP","20140502","9","D","BP" 
>>> "R","1","TB","CLM"," "," ","3U"," ","47000","0","47000","0"," ","0"," 
>>> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","25000","25000"," 
>>> ","650","F","D","D","6"," "," "," ","1:20PM ","1:51PM ","0122"," ","Clm 
>>> 25000","Fast","","16","87"," 
>>> ","","","64","117.39","2266","4648","11129","0","0"," 
>>> ","","112089","Good","Cloudy","","","Y" 
>>> "R","2","TB","CLM"," ","B","3U"," ","34000","0","34000","0"," ","0"," 
>>> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","25000","25000"," 
>>> ","600","F","D","D","7"," "," "," ","1:51PM ","2:22PM ","0151"," ","Clm 
>>> 25000N2L","Fast","","16","79"," 
>>> ","","","64","112.36","2444","4803","10003","0","0"," 
>>> ","","261868","Poor","Cloudy","","","Y" 
>>> "R","3","TB","STK","S"," ","3U"," 
>>> ","100000","0","100000","0","A","100000"," ","0"," ","0"," ","0"," 
>>> ","0"," ","0"," ","0","0","0"," ","600","F","D","D","6"," ","Affirmed 
>>> Success S.","AfrmdScsB","2:22PM ","2:53PM ","0222"," 
>>> ","AfrmdScsB100k","Fast","","16","88"," 
>>> ","","","64","110.54","2323","4618","5810","0","0"," 
>>> ","","259015","5","Clear","","","Y" 
>>> "R","4","TB","MCL"," "," ","3U"," ","49200","0","49200","0"," ","0"," 
>>> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","40000","40000"," 
>>> ","850","F","D","D","8"," "," "," ","2:53PM ","3:24PM ","0253"," ","Md 
>>> 40000","Fast","Y","30","72"," 
>>> ","","","64","145.58","2425","4829","11358","13909","0"," 
>>> ","","260343","9","Clear","0","","Y" 
>>> "R","5","TB","ALW"," "," ","3U"," ","77000","0","77000","0"," ","0"," 
>>> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0"," 
>>> ","900","F","D","D","7"," "," "," ","3:24PM ","3:55PM ","0325"," ","Alw 
>>> 77000N1X","Fast","Y","30","74"," 
>>> ","","","64","151.69","2330","4643","11156","13832","0"," 
>>> ","","302065","Good","Clear","","","Y" 
>>> "R","6","TB","MSW","S","B","3U"," ","60000","1200","60000","0"," 
>>> ","0"," ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0"," 
>>> ","800","F","D","D","5"," "," "," ","3:55PM ","4:26PM ","0355"," ","Md 
>>> Sp Wt 58k","Fast","","30","61"," 
>>> ","","","64","140.64","2481","4931","11477","0","0"," 
>>> ","","161404","Good","Clear","","","Y" 
>>> "R","7","TB","CLM"," ","B","3U"," ","40000","0","40000","0"," ","0"," 
>>> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","20000","20000"," 
>>> ","800","F","D","D","6"," "," "," ","4:26PM ","4:57PM ","0427"," ","Clm 
>>> 20000","Fast","","30","68"," 
>>> ","","","64","139.31","2337","4770","11402","0","0"," 
>>> ","","344306","Good","Clear","","","Y" 
>>> "R","8","TB","ALW"," ","B","3U"," ","77000","0","77000","0"," ","0"," 
>>> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0"," 
>>> ","850","F","D","D","7"," "," "," ","4:57PM ","5:28PM ","0457"," ","Alw 
>>> 77000N1X","Fast","","30","76"," 
>>> ","","","64","144.76","2416","4847","11365","13836","0"," 
>>> ","","213021","Good","Clear","","","Y" 
>>> "R","9","TB","STR"," "," ","3U"," ","60000","0","60000","0"," ","0"," 
>>> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","40000"," 
>>> ","700","F","D","D","8"," "," "," ","5:28PM "," ","0528"," ","Alw 
>>> 40000s","Fast","Y","16","81"," 
>>> ","","","64","124.66","2339","4740","11211","0","0"," 
>>> ","","332649","6,8","Clear","0","","Y" 
>>> 
> "S","1","000008813341TB","Coolusive","20100124","KY","TB","Colt","Bay","Ice 
>>> Cool Kitty","2003","TB","Elusive Quality","1993","TB","Tomorrows 
>>> Cat","1995","TB","Gone 
>>> West","1984","TB","122","0","L","","28200","Velasquez","Cornelio","H."," 
>>> ","Jacobson","David"," ","Drawing Away Stable and Jacobson, David"," 
>>> "," ","265","N"," 
>>> 
>> 
> ","0","N","5","5","3","3","4","0","0","1","1","1","10","200","0","0","100","75","510","320","0","0","0","0","N","25000","4w

>>> into lane, held","chase 2o turn, bid 4w turning for home,took over, 
>>> held 
>>> sway","7.30","3.80","2.70","Y","000000002103TE","TE","Barbara","Robert"," 
>>> ","000001976480O6","O6","Averill","Bradley","E."," 
>>> ","N","0","N","","0","","87","Lansdon B. Robbins & Kevin 
>>> Callahan","000000257611TE","000000002695JE" 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
> 
 		 	   		  
Mime
View raw message