nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Payne <marka...@hotmail.com>
Subject RE: ExtractText usage
Date Wed, 09 Sep 2015 14:52:28 GMT
Chris,

I would venture to guess that the duplicate lines are actually the result of the ReplaceText
processor
that you are using. Regexes can be very powerful but can certainly result in a headache. Especially
if using things like .*

This is because .* can match 0 characters, so it will match the text that you want, and then
again match
the 0 characters following it, resulting in duplicate text.

You should be able to easily prove or disprove my theory by looking at the Data Provenance
to see what
was sent to the PutFile processor.

Thanks
-Mark

________________________________
> Date: Wed, 9 Sep 2015 08:55:44 -0400 
> Subject: Re: ExtractText usage 
> From: wilsoncj1@gmail.com 
> To: users@nifi.apache.org 
> 
> Bryan, thank you for the template, I'll look through that today and see 
> if that will do the trick. The multi-line regex for capturing all 
> lines beginning with "R and ending with the next line which begins with 
> "S" is below. 
> 
> # multi-line 
> # turn off greedy matches 
> # lookahead for lines beginning with "S" 
> 
> (?m)(\"R\"\,.*?)(?=^\"S\"\,) 
> 
> There's some weirdness with PutFile in that when I replace the text 
> with 8-10 lines (depending on input) I get duplicate lines in the 
> output file (from 8 lines to 3800 lines)? Very strange. 
> 
> Mark, that's what I was thinking when I first pulled down ExtractText. 
> I think the processor you describe would be incredibly useful for these 
> kinds of cases and maybe we can just extend Aldrin's JIRA 
> (https://issues.apache.org/jira/browse/NIFI-921) to fully flesh that 
> out? I'm happy to help with the the JIRA or testing as needed. 
> 
> My two cents with NiFi thus far is I would have expected things like 
> text parsing and simple file IO to be there. Like the issue with 
> PutFile, my expectation with that processor would it would "put a new 
> file" and clobber the contents with the new attribute passed in or 
> append if specified. I could do a lot with access to stdin/stdout and 
> it seems like I have to extend that functionality with additional 
> scripting. 
> 
> Again, I'm just getting up to speed and thank you all for the help. 
> 
> -Chris 
> 
> On Tue, Sep 8, 2015 at 11:00 PM, Aldrin Piri 
> <aldrinpiri@gmail.com<mailto:aldrinpiri@gmail.com>> wrote: 
> Mark, 
> 
> The need is certainly there. This core functionality is quite similar 
> to what NIFI-921 (sorry for no link, doing mobile) is after. The issue 
> that both of these are addressing is that there isn't a simple way to 
> promote "simple" data to an attribute. There are some important 
> distinctions between this functionality and those in the ticket but the 
> core need is the same. May not smash into one processor but if this 
> generates a ticket please link to consider both in the process. 
> On Tue, Sep 8, 2015 at 20:25 Mark Payne 
> <markap14@hotmail.com<mailto:markap14@hotmail.com>> wrote: 
> I'm wondering if we should perhaps look into building a RouteText processor. 
> 
> The way that I could see this working is to have a few different properties: 
> 
> Routing Strategy: 
> - Route matching lines to 'matched' if all match 
> - Route matching lines to 'matched' if any match 
> - Route each line to matching Property Name 
> - Route FlowFIle to 'matched' if all lines match 
> - Route FlowFile to 'matched' if any line matches 
> 
> A Match Strategy 
> - Starts With 
> - Ends With 
> - Contains 
> - Matches Regular Expression 
> - Equals 
> 
> And then user-defined properties to search for. 
> 
> So to find lines that begin with "R 
> 
> You would simply add a property named "Begins with R" and set the value 
> to : "R 
> Then set the Match Strategy to Starts With 
> And Routing Strategy to "Route each line to matching Property Name" 
> 
> Then, any line that begins with "R will be routed to the Begins with R 
> relationship. 
> This would be a simple way to pull out any particular lines of interest 
> in a text file. 
> 
> I can see this being very useful for processing log files, CSV, etc. 
> 
> 
> 
> 
> ________________________________ 
>> Date: Tue, 8 Sep 2015 17:02:54 -0400 
>> Subject: Re: ExtractText usage 
>> From: bbende@gmail.com<mailto:bbende@gmail.com> 
>> To: users@nifi.apache.org<mailto:users@nifi.apache.org> 
>> 
>> Chris, 
>> 
>> Thanks for the detailed explanation. It sounds like you are headed down 
>> the right path and looking at the right processors. 
>> 
>> I attempted to create a flow that routes certain lines of your example 
>> data down a "high priority" route based on the content, and all the 
>> others to a different route where they could be stored somewhere for 
>> later. I put a template for the flow on our Wiki page [1]. 
>> You could think of the UpdateAttribute processors at the end as being 
>> replaced by the things you mentioned, such as processors to store in 
>> Hive or put to Kafka. 
>> 
>> If you haven't used a template before, the user guide has some good 
>> information here [2] and here [3]. 
>> 
>> Let us know if there is anything else we can do to help. 
>> 
>> -Bryan 
>> 
>> [1] 
>> https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates 
>> [2] 
>> https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Import_Template 
>> [3] 
> https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#instantiating-a-template

>> 
>> On Tue, Sep 8, 2015 at 4:17 PM, Christopher Wilson 
>> 
> <wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com><mailto:wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com>>>

> wrote: 
>> 
>> This [toy] example was to learn how this system works, to be honest. 
>> I'm glad I used it because the multi-line regex caught me off guard. 
>> To be honest, developing these is always a combination of regexpal, 
>> grep, and some other hacky-ness that's painful. If I knew more about 
>> the regex engine going in that'd be nice to have in the docs but long 
>> term if NiFi is going to require high-fidelity content matches then 
>> pulling in regexpal/regexr would be a good thing. 
>> 
>> The [real] case(s) I'm going to be working on once I understand this 
>> better deals with a variety of log data - not all of it created equal - 
>> which is why I started here. The current case is log data where 
>> multiple applications *may* write to the same log file (syslog) with 
>> different payloads (strings, json, etc). I'd rather not build the 
>> routing functions outside of NiFi in code 
>> (rsyslog/Python/Kafka/Spark/etc) and use the security/provenance 
>> mechanisms that are part of this system and pipeline that data out to 
>> other places - whether they be files, Hive, HBase, websockets, etc. 
>> 
>> This should be simple to implement since it's just stdout and not too 
>> dissimilar to how you read from a Kafka topic. In fact, that's 
>> probably the path I'll go down initially - but would like some solution 
>> for those files that don't fit this model. 
>> 
>> What I'd like to do is provide a list of regex's and RouteOnMatch and 
>> perform some action like insert into Hive/HBase, send to Spark/Kafka, 
>> or other processors. Imagine a critical syslog alert that MUST go to a 
>> critical service desk queue as opposed to *.info which might just pipe 
>> into a Hive table for exploratory analysis later. 
>> 
>> I know RouteOnContent has this capability and I will most likely pipe 
>> syslog/Kafka data initially - but as I said, not all log files are 
>> equal and there may be some that just get read in a la carte and 
>> discriminating between line 10 and 100 may be important. I also think 
>> that adding an attribute and then sending it along could be 
>> short-circuited with just a "match and forward" mechanism rather than 
>> copying content into attributes which again goes back into the 
>> hit-or-miss regex machine. 
>> 
>> I don't know about the impact of multiple FlowFiles but is there an 
>> accumulator that will allow me to take N lines and accumulate them into 
>> a single flow file? 
>> 
>> -Chris 
>> 
>> 
>> On Tue, Sep 8, 2015 at 3:00 PM, Bryan Bende 
>> 
> <bbende@gmail.com<mailto:bbende@gmail.com><mailto:bbende@gmail.com<mailto:bbende@gmail.com>>>

> wrote: 
>> Chris, 
>> 
>> After you extract the lines you are interested in, what do you want to 
>> do with the data after that? are you delivering to another system? 
>> performing more processing on the data? 
>> 
>> Just want to make sure we fully understand the scenario so we can offer 
>> the best possible solution. 
>> 
>> Thanks, 
>> 
>> Bryan 
>> 
>> On Tue, Sep 8, 2015 at 2:43 PM, Christopher Wilson 
>> 
> <wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com><mailto:wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com>>>

> wrote: 
>> I've moved the ball a bit closer to the goal - I enabled DOTALL Mode 
>> and increased the Capture Group Length to 4096. That grabs everything 
>> from the first line beginning with "R" to some of the "S"'s. 
>> 
>> Having a bit of trouble terminating the regex though. 
>> 
>> Once I get that sorted I'll post the result, but I have to say that the 
>> capture group length could be problematic "in the wild". In a perfect 
>> world you would know the length up front - but I can see plenty of 
>> cases where that's not going to be the case. 
>> 
>> -Chris 
>> 
>> On Tue, Sep 8, 2015 at 2:05 PM, Mark Payne 
>> 
> <markap14@hotmail.com<mailto:markap14@hotmail.com><mailto:markap14@hotmail.com<mailto:markap14@hotmail.com>>>

> wrote: 
>> Agreed. Bryan's suggestion will give you the ability to match each line 
>> against the regex, 
>> rather than trying to match the entire file. It would result in a new 
>> FlowFile for each line of 
>> text, though, as he said. But if you need to rebuild a single file, 
>> those could potentially be 
>> merged together using a MergeContent processor, as well. 
>> 
>> ________________________________ 
>>> Date: Tue, 8 Sep 2015 13:03:08 -0400 
>>> Subject: Re: ExtractText usage 
>>> From: 
> bbende@gmail.com<mailto:bbende@gmail.com><mailto:bbende@gmail.com<mailto:bbende@gmail.com>>

>>> To: 
> users@nifi.apache.org<mailto:users@nifi.apache.org><mailto:users@nifi.apache.org<mailto:users@nifi.apache.org>>

>>> 
>>> Chris, 
>>> 
>>> I think the issue is that ExtractText is not reading the file line by 
>>> line, and then applying your pattern to each line. It is applying the 
>>> pattern to the whole content of the file so you would need a regex that 
>>> repeated the pattern you were looking for so that it captured multiple 
>>> times. 
>>> 
>>> When I tested your example, it was actually extracting the first match 
>>> 3 times which I think is because of the following... 
>>> - It always puts the first match in the property base name, in this 
>>> case "regex", 
>>> - then it puts the entire match in index 0, in this case regex.0, and 
>>> in this case it is only matching the first occurrence 
>>> - and then all of the matches would be in order after that staring with 
>>> index 1, which in this case there is only 1 match so it is just regex.1 
>>> 
>>> Another solution that might simpler is to put a SplitText processor 
>>> between GetFile and ExtractText, and set the Line Split Count to 1. 
>>> This will send 1 line at a time to your ExtractTextProcessor which 
>>> would then match only the lines starting with 'R'. 
>>> The downside is that all of the lines with 'R' would be in different 
>>> FlowFiles, but this may or may not matter depending what you wanted to 
>>> do with them after. 
>>> 
>>> -Bryan 
>>> 
>>> 
>>> On Tue, Sep 8, 2015 at 12:12 PM, Christopher Wilson 
>>> 
>> 
> <wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com><mailto:wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com>><mailto:wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com><mailto:wilsoncj1@gmail.com<mailto:wilsoncj1@gmail.com>>>>

>> wrote: 
>>> I'm trying to read a directory of .csv files which have 3 different 
>>> schemas/list types (not my idea). The descriptor is in the first 
>>> column of the csv file. I'm reading the files in using GetFile and 
>>> passing them into ExtractText, but I'm only getting the first 3 (of 8) 
>>> lines matching my first regex. What I want to do is grab all the lines 
>>> beginning with "R" and dump them off to a file (for now). My end goal 
>>> would be to loop through these grab lines, or blocks of lines, by regex 
>>> and route them downstream based on that regex. 
>>> 
>>> Details and first 11 lines of a sample file below. 
>>> 
>>> Thanks in advance. 
>>> 
>>> -Chris 
>>> 
>>> NiFi version: 0.2.1 
>>> OS: Ubuntu 14.01 
>>> JVM: java-1.7.0-openjdk-amd64 
>>> 
>>> ExtractText: 
>>> 
>>> Enable Multiline = True 
>>> Enable Unix Lines Mode = True 
>>> regex = ^("R.*)$ 
>>> 
>>> 
>>> "H","USA","BP","20140502","9","D","BP" 
>>> "R","1","TB","CLM"," "," ","3U"," ","47000","0","47000","0"," ","0"," 
>>> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","25000","25000"," 
>>> ","650","F","D","D","6"," "," "," ","1:20PM ","1:51PM ","0122"," ","Clm 
>>> 25000","Fast","","16","87"," 
>>> ","","","64","117.39","2266","4648","11129","0","0"," 
>>> ","","112089","Good","Cloudy","","","Y" 
>>> "R","2","TB","CLM"," ","B","3U"," ","34000","0","34000","0"," ","0"," 
>>> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","25000","25000"," 
>>> ","600","F","D","D","7"," "," "," ","1:51PM ","2:22PM ","0151"," ","Clm 
>>> 25000N2L","Fast","","16","79"," 
>>> ","","","64","112.36","2444","4803","10003","0","0"," 
>>> ","","261868","Poor","Cloudy","","","Y" 
>>> "R","3","TB","STK","S"," ","3U"," 
>>> ","100000","0","100000","0","A","100000"," ","0"," ","0"," ","0"," 
>>> ","0"," ","0"," ","0","0","0"," ","600","F","D","D","6"," ","Affirmed 
>>> Success S.","AfrmdScsB","2:22PM ","2:53PM ","0222"," 
>>> ","AfrmdScsB100k","Fast","","16","88"," 
>>> ","","","64","110.54","2323","4618","5810","0","0"," 
>>> ","","259015","5","Clear","","","Y" 
>>> "R","4","TB","MCL"," "," ","3U"," ","49200","0","49200","0"," ","0"," 
>>> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","40000","40000"," 
>>> ","850","F","D","D","8"," "," "," ","2:53PM ","3:24PM ","0253"," ","Md 
>>> 40000","Fast","Y","30","72"," 
>>> ","","","64","145.58","2425","4829","11358","13909","0"," 
>>> ","","260343","9","Clear","0","","Y" 
>>> "R","5","TB","ALW"," "," ","3U"," ","77000","0","77000","0"," ","0"," 
>>> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0"," 
>>> ","900","F","D","D","7"," "," "," ","3:24PM ","3:55PM ","0325"," ","Alw 
>>> 77000N1X","Fast","Y","30","74"," 
>>> ","","","64","151.69","2330","4643","11156","13832","0"," 
>>> ","","302065","Good","Clear","","","Y" 
>>> "R","6","TB","MSW","S","B","3U"," ","60000","1200","60000","0"," 
>>> ","0"," ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0"," 
>>> ","800","F","D","D","5"," "," "," ","3:55PM ","4:26PM ","0355"," ","Md 
>>> Sp Wt 58k","Fast","","30","61"," 
>>> ","","","64","140.64","2481","4931","11477","0","0"," 
>>> ","","161404","Good","Clear","","","Y" 
>>> "R","7","TB","CLM"," ","B","3U"," ","40000","0","40000","0"," ","0"," 
>>> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","20000","20000"," 
>>> ","800","F","D","D","6"," "," "," ","4:26PM ","4:57PM ","0427"," ","Clm 
>>> 20000","Fast","","30","68"," 
>>> ","","","64","139.31","2337","4770","11402","0","0"," 
>>> ","","344306","Good","Clear","","","Y" 
>>> "R","8","TB","ALW"," ","B","3U"," ","77000","0","77000","0"," ","0"," 
>>> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","0"," 
>>> ","850","F","D","D","7"," "," "," ","4:57PM ","5:28PM ","0457"," ","Alw 
>>> 77000N1X","Fast","","30","76"," 
>>> ","","","64","144.76","2416","4847","11365","13836","0"," 
>>> ","","213021","Good","Clear","","","Y" 
>>> "R","9","TB","STR"," "," ","3U"," ","60000","0","60000","0"," ","0"," 
>>> ","0"," ","0"," ","0"," ","0"," ","0"," ","0","0","40000"," 
>>> ","700","F","D","D","8"," "," "," ","5:28PM "," ","0528"," ","Alw 
>>> 40000s","Fast","Y","16","81"," 
>>> ","","","64","124.66","2339","4740","11211","0","0"," 
>>> ","","332649","6,8","Clear","0","","Y" 
>>> 
> "S","1","000008813341TB","Coolusive","20100124","KY","TB","Colt","Bay","Ice 
>>> Cool Kitty","2003","TB","Elusive Quality","1993","TB","Tomorrows 
>>> Cat","1995","TB","Gone 
>>> West","1984","TB","122","0","L","","28200","Velasquez","Cornelio","H."," 
>>> ","Jacobson","David"," ","Drawing Away Stable and Jacobson, David"," 
>>> "," ","265","N"," 
>>> 
>> 
> ","0","N","5","5","3","3","4","0","0","1","1","1","10","200","0","0","100","75","510","320","0","0","0","0","N","25000","4w

>>> into lane, held","chase 2o turn, bid 4w turning for home,took over, 
>>> held 
>>> sway","7.30","3.80","2.70","Y","000000002103TE","TE","Barbara","Robert"," 
>>> ","000001976480O6","O6","Averill","Bradley","E."," 
>>> ","N","0","N","","0","","87","Lansdon B. Robbins & Kevin 
>>> Callahan","000000257611TE","000000002695JE" 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
 		 	   		  
Mime
View raw message