nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Venkat Williams <venkat.willi...@gmail.com>
Subject Create a CovertCSVToJSON Processor
Date Tue, 06 Jun 2017 07:12:04 GMT
Hi



I want to contribute this processor implementation code to NIFI project.



*Requirements:*



1)     Convert CSV files to a standard/canonical JSON format

a.       One JSON object/document per row in the input CSV

b.      Format should encode the data as JSON fields and values

c.       JSON Field names should be the original column header with any invalid
characters handled properly.

d.      Values should be kept unaltered

2)     Optionally, be able to specify an expected header used to
validate/reject input CSVs

3)     Support both tab and comma delimited files

a.     Auto-detect based on header row is easy

b.    Allow operator to specify the delimiter as a way to override the
auto-detect logic

4)     Handle arbitrarily large files...

a.       should handle CSV files of any length ( achieve this using
batching)

5)     Handle errors gracefully

a.       File failures

b.      Row failures

6)     Support for RFC-4180 <https://tools.ietf.org/html/rfc4180> formatted
CSV files and be sure to handle edge cases like embedded newlines in a
field value and escaped double quotes



Example:

Input CSV:

user,source_ip,source_country,destination_ip,url,timestamp

Venkat,192.168.0.1,IN,23.246.97.82,
http://www.google.com,2017-02-22T14:46:24-05:00
<http://www.google.com%2C2017-02-22t14:46:24-05:0/>



Desired output JSON:

{"user":"Venkat","source_ip":"192.168.0.1","source_country":"IN","destination_ip":"23.246.97.82","url":"
http://www.google.com","timestamp":"2017-02-22T14:46:24-05:00"}
<http://www.google.com/>



*Implementation:*

1)      Reviewed all the existing csv libraries which can be used to
transform csv record to json document by supporting  RFC-4180
<https://tools.ietf.org/html/rfc4180> standard to handle embedded new lines
in field value and escaped quotes. Found OpenCSV, FastCSV, UnivocityCSV
Libraries can do this job most effectively.

2)      Selected Univocity CSV Library as I can do most of validations
which are part of my requirements only using this library. When I did the
performance testing using 5 GB and 10GB arbitrarily large files this gave
better results compared any others.

3)      Processed CSV Records are being emitted immediately rather than
waiting complete file processing. Used some configurable number in
processor to wait until that many records to emit. With this approach I
could process 5GB CSV data records using 1GB NIFI RAM which is most
effective / attractive feature in this whole implementation to handle large
files. ( This is common limitation in most of processors like SplitText,
SplitXML, etc wait until whole file processing and stores the results
FlowFile ArrayList within the processor this cause heap size/outofmemory
issues)

4) Handled File errors and record errors gracefully using user defined
configurations and processor routes.

Can anyone suggest how to proceed further whether I have to open new issue
or if I have to use any existing issue. ( I don't find any which matches to
this requirement)

Mime
View raw message