nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Payne <marka...@hotmail.com>
Subject Re: ConvertCSVToAvro taking a lot of time when passing single record as an input.
Date Mon, 02 Apr 2018 13:00:16 GMT
Mohit,

I agree that 45-50 records per second is quite slow. I'm not very familiar with the implementation
of
ConvertCSVToAvro, but it may well be that it must perform some sort of initialization for
each FlowFile
that it receives, which would explain why it's fast for a single incoming FlowFile and slow
for a large number.

Additionally, when you start splitting the data like that, you're generating a lot more FlowFiles,
which means
a lot more updates to both the FlowFile Repository and the Provenance Repository. As a result,
you're basically
taxing the NiFi framework far more than if you keep the data as a single FlowFile. On my laptop,
though, I would
expect more than 45-50 FlowFiles per second through most processors, but I don't know what
kind of hardware
you are running on.

In general, though, it is best to keep data together instead of splitting it apart. Since
the ConvertCSVToAvro can
handle many CSV records, is there a reason to split the data to begin with? Also, I would
recommend you look
at using the Record-based processors [1][2] such as ConvertRecord instead of the ConvertABCtoXYZ
processors, as
those are older processors and often don't work as well and the Record-oriented processors
often allow you to keep
data together as a single FlowFile throughout your entire flow, which makes the performance
far better and makes the
flow much easier to design.

Thanks
-Mark



[1] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
[2] https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries


On Apr 2, 2018, at 8:49 AM, Mohit <mohit.jain@open-insights.co.in<mailto:mohit.jain@open-insights.co.in>>
wrote:

Hi,

I’m trying to capture bad records from ConvertCSVToAvro processor. For that, I’m using
two SplitText processors in a row to create chunks and then each record per flow file.

My flow is  - ListFile -> FetchFile -> SplitText(10000 records) -> SplitText(1 record)
-> ConvertCSVToAvro -> *(futher processing)

I have a 10 MB file with 15 columns per row and 64000 records. Normal flow (without SplitText)
completes in few seconds. But when I’m using the above flow, ConvertCSVToAvro processor
works drastically slow(45-50 rec/sec).
I’m not able to conclude where I’m doing wrong in the flow.

I’m using Nifi 1.5.0 .

Any quick input would be appreciated.



Thanks,
Mohit

Mime
View raw message