nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Taft <a...@adamtaft.com>
Subject Re: Common data exchange formats and tabular data
Date Mon, 02 Nov 2015 17:47:23 GMT
CSV (and friends like TSV, PSV, etc.) are obviously very naturally oriented
to representing tabular data.  I don't know that there would be a lot of
value using/inventing a JSON or AVRO format in place of CSV for tabular data

The only slight advantage might be maintaining type information, which JSON
or AVRO could carry (since CSV is basically all strings).  But the type
information alone might make dealing with data in follow-on processors a
bit more difficult.

That being said, I do like the concept of having the tabular data payload
being processed separately from the original remote "fetch" call.  This
definitely follows the unix philosophy better favoring data flow
composition.  A lot of data can be converted/represented to string based
tabular form, regardless of the original source, which could enable
interesting possibilities for many data flows.

Adam

On Mon, Nov 2, 2015 at 9:06 AM, Matthew Burgess <mattyb149@gmail.com> wrote:

> Hello all,
>
> I am new to the NiFi community but I have a good amount of experience with
> ETL tools and applications that process lots of tabular data. In my
> experience, JSON is only useful as the common format for tabular data if it
> has a "flat" schema, in which case there aren't any advantages for JSON
> over
> other formats such as CSV. However, I've seen lots of "CSV" files that
> don't
> seem to adhere to any standard, so I would presume NiFi would need a rigid
> schema such as RFC-4180 (http://www.rfc-base.org/txt/rfc-4180.txt).
>
> However CSV isn't a natural way to express the schema of the rows, so JSON
> or YAML is probably a better choice. There's a format called Tabular Data
> Package that combines CSV and JSON for tabular data serialization:
> http://dataprotocols.org/tabular-data-package/
>
> Avro is similar, but the schema must always be provided with the data. In
> the case of NiFi DataFlows, it's likely more efficient to send the schema
> once as an initialization packet (I can't remember the real term in NiFi),
> then the rows can be streamed individually, in batches of user-defined
> size,
> sampled, etc.
>
> Having said all that, there are projects like Apache Drill that can handle
> non-flat JSON files and still present them in tabular format. They have
> functions like KVGEN and FLATTEN to transform the document(s) into tabular
> format. In the use cases you present below, you already know the data is
> tabular and as such, the extra data model transformation is not needed.  If
> this is desired, it should be apparent that a Streaming JSON processor
> would
> be necessary; otherwise, for large tabular datasets you'd have to read the
> whole JSON file into memory to parse individual rows.
>
> Regards,
> Matt
>
> From:  Toivo Adams <toivo.adams@gmail.com>
> Reply-To:  <dev@nifi.apache.org>
> Date:  Monday, November 2, 2015 at 5:12 AM
> To:  <dev@nifi.apache.org>
> Subject:  Common data exchange formats and tabular data
>
> All,
> Some processors get/put data in tabular form. (PutSQL, ExecuteSQL, soon
> Cassandra)
> It would be very nice to be able use such processors in pipeline ­ previous
> processor output is next processor input. To achieve this, processors
> should
> use common data exchange format.
>
> JSON is most widely used, it¹s simple and readable. But JSON lacks schema.
> Schema can be very useful to automate data insert/update.
>
> Avro has schema, but is somewhat more complicated and not widely used
> (yet?).
>
> Please see also:
>
> https://issues.apache.org/jira/browse/NIFI-978
>
> https://issues.apache.org/jira/browse/NIFI-901
>
> Opinions?
>
> Thanks
> Toivo
>
>
>
>
> --
> View this message in context:
>
> http://apache-nifi-developer-list.39713.n7.nabble.com/Common-data-exchange-f
> ormats-and-tabular-data-tp3508.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message