nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Witt <>
Subject Re: Nifi partition data by date
Date Thu, 03 Nov 2016 01:10:28 GMT
I agree with James.  The general pattern here is

Split with Grouping:
  Take a look at RouteText.  This allows you to efficiently split up
line oriented data into groups based on matching values rather than
spilt text which will be a line for line split.

Merge Grouped Data:
  MergeContent processor will do the trick and you can use correlation
feature to align only those which are from the same group/pattern.

Write to destination:
  You can write directly to HDFS using PutHDFS or you can prepare the
data and write to Hive.


On Wed, Nov 2, 2016 at 9:01 PM, James Wing <> wrote:
> This is absolutely possible.  A sample sequence of processors might include:
> 1. UpdateAttribute - to extract a record date from the flowfile content into
> an attribute, 'recordgroup' for example
> 2. MergeContent - to group related records together, setting the Correlation
> Attribute Name property to use 'recordgroup'
> 3. UpdateAttribute - (optional) to apply the 'recordgroup' attribute to the
> 'path' and/or 'filename' attributes, depending on how you do #4.  May be
> useful to get customized filenames with extensions.
> 4. Put* - to write the grouped file to storage (PutFile, PutHDFS,
> PutS3Object, etc.).  With PutHDFS for example, use Expression Language in
> the Directory property to apply your grouping - like
> '/tmp/hive/records/${recordgroup}' to get '/tmp/hive/records/2016-01-01'.
> In concept, it's that simple.  The #2 MergeContent step can be more
> complicated as you consider how many files should be output from the stream,
> how big they should be, how frequently, and how many bins are likely to be
> open collecting files at any one time.  You might also consider compressing
> the files.
> Thanks,
> James
> On Wed, Nov 2, 2016 at 5:34 PM, Santiago Ciciliani
> <> wrote:
>> I'm trying to split a stream of data into multiple different files based
>> on the content date.
>> So imagine that you are receiving streams of logs and you want to save as
>> a Hive partitioned table so for example all records with date 2016-01-01
>> into directory dt=2016-01-01.
>> Is this even possible?
>> Thanks

View raw message