nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Taft <a...@adamtaft.com>
Subject Re: Re: Processor: User friendly vs system friendly design
Date Fri, 18 Mar 2016 21:41:59 GMT
Uwe,

I'll take a look at your code sometime soon.  However, just to point you in
the direction, I'd suggest extracting your single line CSV data into
flowfile attributes named as you've demonstrated.  i.e.  create a processor
which reads each CSV column as a flowfile attribute, using a configured
naming convention.

For example, using "column" as your prefix with your example input, you'd
end up with a single flowfile with attributes like:

column0 = Peterson
column1 = Jenny
column2 = New York
column3 = USA

Flowfile attributes are effectively a Map<String,String>.  So in your
Velocity processor, you would pass the Map of flowfile attributes to the
template engine and record the results to the flowfile content.

Using SplitText seems correct up front (though like you said, you lose the
CSV header line).  You'd need two additional processors, from my
perspective:

(input) -> SplitText -> ExtractCSVColumns -> ApplyVelocityTemplate ->
(output)

It's the "​split row into fields and merge with template" that we would
want to separate into two processors instead of one.

You're very much on the right track, I believe.  If the above doesn't help,
I'll try and jump in on a code example when I can.

Adam


On Fri, Mar 18, 2016 at 5:04 PM, Uwe Geercken <uwe.geercken@web.de> wrote:

> Adam,
>
> I don't see an obvious way for your suggestion of "Read columns from a
> single CSV line into flowfile attributes." - I would need your advice how I
> can achieve it.
>
> Thinking about it in more detail, I have following issues:
> - the incomming flowfile may have many columns. so adding the columns
> manually as attributes with UpdateAttributes is not feasible
> - I have setup a flow where I use SplitText to divide the flowfile into
> multiple flowfiles, so there won't be a header row I can use to get the
> column names. So I think I can only use abstract column names plus a
> running number. e.g. column0, column1, etc.
>
> So for the moment I have coded the processor like described below. At the
> moment I am still "thinking in CSV" but I will check it with other formats
> later. The user can steer follwoing settings: path where the template is
> stored, name of the template file, the label for the columns (I call it
> prefix) and the separator based on which the split of the row is done.
>
> Example Flowfile content (user has chosen "comma" as separator:
>
> Peterson, Jenny, New York, USA
>
> Example template (user has chosen "column" as the prefix):
>
> {
>         "name": "$column0",
>         "first": "$column1",
>         "city": "$column2",
>         "country": "$column3"
> }
>
> Example flow:
>
> GetFile: Get CSV File >> SplitText : split into multiple flowfiles, one
> per row >> TemplateProcessor:
> ​​
> split row into fields and merge with template >> MergeContent: merge
> flowfiles into one >> PutFile: put the file to the filesystem
>
> Example result:
>
> {
>         "name": "Peterson",
>         "first": "Jenny",
>         "city": "New York",
>         "country": "USA"
>  }
>
> I will test the processor now for larger files, empty files and other
> exceptions. If you are interested the code is here:
>
> https://github.com/uwegeercken/nifi_processors
>
> Greetings,
>
> Uwe
>
>
>
> > Gesendet: Freitag, 18. März 2016 um 18:58 Uhr
> > Von: "Adam Taft" <adam@adamtaft.com>
> > An: dev@nifi.apache.org
> > Betreff: Re: Processor: User friendly vs system friendly design
> >
> > Uwe,
> >
> > The Developer Guide[1] and Contributor Guide[2] are pretty solid.  The
> > Developer Guide has a section dealing with reading & writing flowfile
> > attributes.  Please check these out, and then if you have any specific
> > questions, please feel free to reply.
> >
> > For inclusion in NIFI directly, you'd want to create a NIFI Jira ticket
> > mentioning the new feature, and then fork the NIFI project in Github and
> > send a Pull Request referencing the ticket.  However, if you just want
> some
> > feedback on suitability and consideration for inclusion, using your own
> > personal Github project and sending a link would be fine.
> >
> > Having a template conversion processor would be a nice addition.  Making
> it
> > generic to support Velocity, FreeMarker, and others might be really nice.
> > Extra bonus points for Markdown or Asciidoc transforms as well (but these
> > might be too separate of a use case).
> >
> > Hope this helps.
> >
> > Adam
> >
> > [1]  http://nifi.apache.org/developer-guide.html
> >
> > [2]  https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide
> >
> >
> >
> >
> > On Fri, Mar 18, 2016 at 1:36 PM, Uwe Geercken <uwe.geercken@web.de>
> wrote:
> >
> > > Adam,
> > >
> > > interesting and I agree. that sounds very good.
> > >
> > > can you give me short tip of how to access attributes from code?
> > >
> > > once I have something usable or for testing where would I publish it?
> just
> > > on my github site? or is there a place for sharing?
> > >
> > > greetings
> > >
> > > Uwe
> > >
> > >
> > >
> > > Gesendet: Freitag, 18. März 2016 um 18:03 Uhr
> > > Von: "Adam Taft" <adam@adamtaft.com>
> > > An: dev@nifi.apache.org
> > > Betreff: Re: Processor: User friendly vs system friendly design
> > > I'm probably on the far end of favoring composibility and processor
> reuse.
> > > In this case, I would even go one step further and suggest that you're
> > > talking about three separate operations:
> > >
> > > 1. Split a multi-line CSV input file into individual single line
> flowfiles.
> > > 2. Read columns from a single CSV line into flowfile attributes.
> > > 3. Pass flowfile attributes into the Velocity transform processor.
> > >
> > > The point here, have you considered driving your Velocity template
> > > transform using flowfile attributes as opposed to CSV? Flowfile
> attributes
> > > are NIFI's lowest common data representation, many many processors
> create
> > > attributes which would enable your Velocity processor to be used by
> more
> > > than just CSV input.
> > >
> > > Adam
> > >
> > >
> > >
> > > On Fri, Mar 18, 2016 at 11:06 AM, Uwe Geercken <uwe.geercken@web.de>
> > > wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > my first mailing here. I am a Java developer, using Apache Velocity,
> > > > Drill, Tomcat, Ant, Pentaho ETL, MongoDb, Mysql and more and I am
> very
> > > much
> > > > a data guy.
> > > >
> > > > I have used Nifi for a while now and started yesterday of coding my
> first
> > > > processor. I basically do it to widen my knowledge and learn
> something
> > > new.
> > > >
> > > > I started with the idea of combining Apache Velocity - a template
> engine
> > > -
> > > > with Nifi. So in comes a CSV file, it gets merged with a template
> > > > containing formatting information and some placeholders (and some
> limited
> > > > logic maybe) and out comes a new set of data, formatted differently.
> So
> > > it
> > > > separates the processing logic from the formatting. One could create
> > > HTML,
> > > > XML, Json or other text based formats from it. Easy to use and very
> > > > efficient.
> > > >
> > > > Now my question is: Should I rather implement the logic this way
> that I
> > > > process a whole CSV file - which usually has multiple lines? That
> would
> > > be
> > > > good for the user as he or she has to deal with only one processor
> doing
> > > > the work. But the logic would be more specialized.
> > > >
> > > > The other way around, I could code the processor to handle one row
> of the
> > > > CSV file and the user will have to come up with a flow that divides
> the
> > > CSV
> > > > file into multiple flowfiles before my processor can be used. That
> is not
> > > > so specialized but it requires more preparation work from the user.
> > > >
> > > > I tend to go the second way. Also because there is already a
> processor
> > > > that will split a file into multiple flowfiles. But I wanted to hear
> your
> > > > opinion of what is the best way to go. Do you have a recommendation
> for
> > > me?
> > > > (Maybe the answer is to do both?!)
> > > >
> > > > Thanks for sharing your thoughts.
> > > >
> > > > Uwe
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message