nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juan Jose Escobar <juanjose.esco...@gmail.com>
Subject Re: Suggestion on how to parse field out of filename
Date Wed, 04 Nov 2015 11:52:59 GMT
Hello, Mark,

I think it should be possible to do it using UpdateAttribute in advanced
mode: define a condition for each of the different formats, and once the
particular format type is identified, get the appropriate substring into a
new attribute - or into the filename attribute if you want to normalize
naming. If I remember correctly, there is no support to extract the regex
groups in Nifi Expression Language in 0.3.0.

Hope this helps

J

On Wed, Nov 4, 2015 at 7:04 AM, Mark Petronic <markpetronic@gmail.com>
wrote:

> Looking for some help on best way to extract a field from a filename. I
> need to parse out the date from the core filename attribute set by the
> UnpackContent processor. I am unzipping files that contain many CSV files
> and these CSV file names vary in format but each has a timestamp included
> in the filename. Example formats are:
>
> Priority_002_20151104123456_00.csv  (20151104123456 is yyyyMMddHHmmss)
> ABC_02_1447586912344.csv (1447586912344 is Unix time in ms)
> XYZ_20151104_1234.csv (20151104_1234 is yyyyMMdd_HHmm)
>
> So, there are various forms to deal with. I need to normalize these into
> yyyyMMddHHmmss. A regex with capture groups would be perfect but I cannot
> quite figure out how to do it. ExtractText does regex with capture groups
> but only against flowfile contents and these are attributes.
> UpdateAttribute only support expression language and that does not have
> regex based extracts of capture groups.
>
> In Python, I would just do something like:
>
> date, time = re.search(r"XYZ_(\d+)_(\d+)\.csv",
> "XYZ_20151104_1234.csv").groups()
>
> Then I could use the expression language format or doDate functions to
> normalize the dates
>
> I know I could use a utility script with ExecuteStreamCommand that I could
> call with the filepath and get back the tokens but was looking for an
> internal way to do it without forking out as there are a lot of archives in
> each zip and that would add to latency in heavy loads.
>
> Any thoughts?
>
> Thanks!
>
>

Mime
View raw message