nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Geercken" <uwe.geerc...@web.de>
Subject Aw: Re: Filtering large CSV files
Date Tue, 05 Apr 2016 17:39:51 GMT
Dimitry,

I was working on a processor for CSV files and one remark came up that we might want to use
the opencsv library for parsing the file.

Here is the link: http://opencsv.sourceforge.net/

Greetings,

Uwe

> Gesendet: Dienstag, 05. April 2016 um 13:00 Uhr
> Von: "Dmitry Goldenberg" <dgoldenberg@hexastax.com>
> An: dev@nifi.apache.org
> Betreff: Re: Filtering large CSV files
>
> Hi Eric,
> 
> Thinking about exactly these use-cases, I filed the following JIRA ticket:
> NIFI-1716 <https://issues.apache.org/jira/browse/NIFI-1716>. It asks for a
> SplitCSV processor, and actually for a GetCSV ingress which would address
> the issue of reading out of a large CSV treating it as a "data source".  I
> was thinking of actually implementing both and committing them.
> 
> NIFI-1280 <https://issues.apache.org/jira/browse/NIFI-1280> is asking for a
> way to filter the CSV columns.  I believe this is best achieved as the CSV
> is getting parsed, in other words, on the GetCSV/SplitCSV, and not as a
> separate step.
> 
> I'm not sure that SplitText is the best way to process CSV data to begin
> with, because with a CSV, there's a chance that a given cell may spill over
> into multiple lines. Such would be the case of embedded newlines within a
> single, quoted cell. I don't think SplitText addresses that and that would
> be one reason to implement GetCSV/SplitCSV using proper CSV parsing
> semantics, the other reason being efficiency of reading.
> 
> As far as the limit on the capturing groups, that seems arbitrary. I think
> that on GetCSV/SplitCSV, if you have a way to identify the filtered out
> columns by their number (index) that should go a long way; perhaps a regex
> is also a good option.  I know it may seem that filtering should be a
> separate step in a given dataflow but from the point of view of efficiency,
> I believe it belongs right in the GetCSV/SplitCSV processors as the CSV
> records are being read and processed.
> 
> - Dmitry
> 
> 
> 
> 
> On Tue, Apr 5, 2016 at 6:36 AM, Eric FALK <eric.falk@uni.lu> wrote:
> 
> > Dear all,
> >
> > I would require to filter large csv files in a data flow. By filtering I
> > mean: scale down the file in terms of columns, and looking for a particular
> > value to match a parameter. I looked into the example, of csv to JSON. I do
> > have a couple of questions:
> >
> > -First I use a SplitText control get each line of the file. It makes
> > things slow, as it seems to generate a flow file for each line. Do I have
> > to proceed this way, or is there an alternative? My csv files are really
> > large and can have millions of lines.
> >
> > -In a second step I am extracting the values with the (.+),(.+),….,(.+)
> > technique, before using a processor to check for a match, on ${csv.146} for
> > instance. Now I have a problem: my csv has 233 fields, so I am getting the
> > message: “ReGex is required to have between 1 and 40 capturing groups but
> > has 233”. Again, is there another way to proceed, am I missing something?
> >
> > Best regards,
> > Eric
>

Mime
View raw message