nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Goldenberg <dgoldenb...@hexastax.com>
Subject Re: Filtering large CSV files
Date Tue, 05 Apr 2016 11:00:51 GMT
Hi Eric,

Thinking about exactly these use-cases, I filed the following JIRA ticket:
NIFI-1716 <https://issues.apache.org/jira/browse/NIFI-1716>. It asks for a
SplitCSV processor, and actually for a GetCSV ingress which would address
the issue of reading out of a large CSV treating it as a "data source".  I
was thinking of actually implementing both and committing them.

NIFI-1280 <https://issues.apache.org/jira/browse/NIFI-1280> is asking for a
way to filter the CSV columns.  I believe this is best achieved as the CSV
is getting parsed, in other words, on the GetCSV/SplitCSV, and not as a
separate step.

I'm not sure that SplitText is the best way to process CSV data to begin
with, because with a CSV, there's a chance that a given cell may spill over
into multiple lines. Such would be the case of embedded newlines within a
single, quoted cell. I don't think SplitText addresses that and that would
be one reason to implement GetCSV/SplitCSV using proper CSV parsing
semantics, the other reason being efficiency of reading.

As far as the limit on the capturing groups, that seems arbitrary. I think
that on GetCSV/SplitCSV, if you have a way to identify the filtered out
columns by their number (index) that should go a long way; perhaps a regex
is also a good option.  I know it may seem that filtering should be a
separate step in a given dataflow but from the point of view of efficiency,
I believe it belongs right in the GetCSV/SplitCSV processors as the CSV
records are being read and processed.

- Dmitry




On Tue, Apr 5, 2016 at 6:36 AM, Eric FALK <eric.falk@uni.lu> wrote:

> Dear all,
>
> I would require to filter large csv files in a data flow. By filtering I
> mean: scale down the file in terms of columns, and looking for a particular
> value to match a parameter. I looked into the example, of csv to JSON. I do
> have a couple of questions:
>
> -First I use a SplitText control get each line of the file. It makes
> things slow, as it seems to generate a flow file for each line. Do I have
> to proceed this way, or is there an alternative? My csv files are really
> large and can have millions of lines.
>
> -In a second step I am extracting the values with the (.+),(.+),….,(.+)
> technique, before using a processor to check for a match, on ${csv.146} for
> instance. Now I have a problem: my csv has 233 fields, so I am getting the
> message: “ReGex is required to have between 1 and 40 capturing groups but
> has 233”. Again, is there another way to proceed, am I missing something?
>
> Best regards,
> Eric

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message