nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Bende <bbe...@gmail.com>
Subject Re: NIFI Usage for Data Transformation
Date Thu, 01 Nov 2018 17:39:58 GMT
How big are the initial CSV files?

If they are large, like millions of lines, or even hundreds of
thousands, then it will be ideal if you can avoid the line-by-line
split, and instead process the lines in place.

This is one of the benefits of the record processors. For example,
with UpdateRecord you can read in a large CSV line by line, apply an
update to each line, and write it back out. So you only ever have one
flow file.

It sounds like you may have a significant amount of custom logic so
you may need a custom processor, but you can still take this approach
of reading a single flow file line by line, and writie out the results
line by line (try to avoid reading the entire content into memory at
one time).


On Thu, Nov 1, 2018 at 1:22 PM Ameer Mawia <ameer.mawia@gmail.com> wrote:
>
> Thanks for the input folks.
>
> I had this impression that for actual processing of the data :
>
> we may have to put in place a custom processor which will have the transformation framework
logic in it.
> Or we can use ExcecuteProcess processor to trigger an external process(which will be
this transformation logic) and route back the output in the NIFI.
>
> Our flow inside the framework generally looks like this:
>
> Split the CSV file line by line.
> For each line Split it in array of string.
> For each record in the array determine its invoke it transformation method.
> Transformation Method contains the transformation logic. This logic can be pretty intensive
like:
>
> searching for hundreds of different pattern.
> lookup against hundreds of configured string constants.
> Appending/Prepending/Trimming/Padding...
>
> Finally map the each record into an output csv format.
>
> So far we have been trying to see if SplitRecord, UpdateRecord, ExtractText, etc can
come in handy?
>
> Thanks,
>
> On Thu, Nov 1, 2018 at 12:39 PM Mike Thomsen <mikerthomsen@gmail.com> wrote:
>>
>> Ameer,
>>
>> Depending on how you implemented the custom framework, you may be able to easily
drop it in place into a custom NiFi processor. Without knowing much about your implementation
details, if you can act on Java streams, Strings, byte arrays and things like that it will
probably be very straight forward to drop in place.
>>
>> This is a really simple of how you could bring it in depending on how encapsulated
your business logic is:
>>
>> @Override
>> public void onTrigger(ProcessContext context, ProcessSession session) throws ProcessException
{
>>     FlowFile input = session.get();
>>     if (input == null) {
>>         return;
>>     }
>>
>>     FlowFile output = session.create(input);
>>     try (InputStream is = session.read(input);
>>         OutputStream os = session.write(output)
>>     ) {
>>         transformerPojo.transform(is, os);
>>
>>         is.close();
>>         os.close();
>>
>>         session.transfer(input, REL_ORIGINAL); //If you created an "original relationship"
>>         session.transfer(output, REL_SUCCESS);
>>     } catch (Exception ex) {
>>         session.remove(output);
>>         session.transfer(input, REL_FAILURE);
>>     }
>> }
>>
>> That's the general idea, and that approach can scale to your disk space limits. Hope
that helps put it into perspective.
>>
>> Mike
>>
>> On Thu, Nov 1, 2018 at 10:16 AM Nathan Gough <thenatog@gmail.com> wrote:
>>>
>>> Hi Ameer,
>>>
>>> This blog by Mark Payne describes how to manipulate record based data like CSV
using schemas: https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi. This would
probably be the most efficient method. And another here: https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries.
>>>
>>> An alternative option would be to port your custom java code into your own NiFi
processor:
>>> https://medium.com/hashmapinc/creating-custom-processors-and-controllers-in-apache-nifi-e14148740ea
under 'Steps for Creating a Custom Apache NiFi Processor'
>>> https://nifi.apache.org/developer-guide.html
>>>
>>> Nathan
>>>
>>> ´╗┐On 10/31/18, 5:02 PM, "Ameer Mawia" <ameer.mawia@gmail.com> wrote:
>>>
>>>     We have a use case where we take data from a source(text data in csv
>>>     format), do transformation and manipulation of textual record, and output
>>>     the data in another (csv)format. This is being done by a Java based custom
>>>     framework, written specifically for this *transformation* piece.
>>>
>>>     Recently as Apache NIFI is being adopted at enterprise level by the
>>>     organisation, we have been asked to try *Apache NIFI* and see if can use
>>>     that as a replacement to this custom tool?
>>>
>>>     *My question is*:
>>>
>>>        - How much leverage does *Apache NIFI *provides on the flowfile *content
>>>        *manipulation?
>>>
>>>     I understand *NIFI *is good for creating data flow pipeline, but is it good
>>>     for *extensive TEXT Transformation* as well?   So far I have not found
>>>     obvious way to achieve that.
>>>
>>>     Appreciate the feedback.
>>>
>>>     Thanks,
>>>
>>>     --
>>>     http://ca.linkedin.com/in/ameermawia
>>>     Toronto, ON
>>>
>>>
>>>
>
>
> --
> http://ca.linkedin.com/in/ameermawia
> Toronto, ON
>

Mime
View raw message