nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Thomsen <>
Subject Re: How to use delta storage format
Date Thu, 02 Apr 2020 18:28:14 GMT
Interesting. You've given me a lot to think about in terms of designing

On Thu, Apr 2, 2020 at 10:43 AM Paul Parker <> wrote:

> @Mike I'd appreciate some feedback.
> Paul Parker <> schrieb am Di., 31. März 2020, 17:07:
>> Let me share answers from the delta community:
>> Answer to Q1:
>> Structured streaming queries can do commits every minute, even every
>> 20-30 seconds. This definitely creates small files. But that is okay,
>> because it is expected that people will periodically compact the files. The
>> same timing should work fine for Nifi and any other streaming engine. It
>> does create 1000-ish versions per day, but that is okay.
>> Answer to Q2:
>> That is up to the sink implementation. Both are okay. In fact, it
>> probably can be combination of both, as long as we dont commit every
>> second. That may not scale well.
>> Answer to Q3:
>> You need a primary node which is responsible for managing the Delta
>> table. That note would be responsible for reading the log, parsing it,
>> updating it, etc. Unfortunately, we have no good non-spark way to read the
>> log, and much less write to the log.
>> there is an experimental uber jar that tries to package delta + spark
>> together into a single jar ... using which you could read the log. its
>> available here -
>> What it does is it basically runs a local-mode (in-process) spark to read
>> the log. This what we are using to build a hive connector, that will allow
>> hive to read delta files. Now the goal of that was to only read. your goal
>> is to write which is definitely more complicated, because for that you have
>> to do much more. Now that uber jar has all the necessary code to do the
>> writing. ... which you could use. but there has to be a driver node which
>> has to collect all the parquet files written by other nodes and atomically
>> commit those parquet files to the Delta log to make them visible to all
>> readers.
>> Does the first orientation help?
>> Mike Thomsen <> schrieb am So., 29. März 2020,
>> 20:29:
>>> It looks like a lot of their connectors rely on external management by
>>> Spark. That is true of Hive, and also of Athena/Presto unless I misread the
>>> documentation. Read some of the fine print near the bottom of this to get
>>> an idea of what I mean:
>>> Hypothetically, we could build a connector for NiFi, but there are some
>>> things about the design of Delta Lake that I am not sure about based on my
>>> own research and what not. Roughly, they are the following:
>>> 1. What is a good timing strategy for doing commits to Delta Lake?
>>> 2. What would trigger a commit in the first place?
>>> 3. Is there a good way to trigger commits that would work within the
>>> masterless cluster design of clustered NiFi instead of requiring a special
>>> "primary node only" processor for executing commits?
>>> Based on my experimentation, one of the biggest questions around the
>>> first point is do you really want potentially thousands or tens of
>>> thousands of time shift events to be created throughout the day? A record
>>> processor that reads in a ton of small record sets and injects that into
>>> the Delta Lake would create a ton of these checkpoints, and they'd be
>>> largely meaningless to people trying to make sense of them for the purpose
>>> of going back and forth in time between versions.
>>> Do we trigger a commit per record set or set a timer?
>>> Most of us on the NiFi dev side have no real experience here. It would
>>> be helpful for us to get some ideas to form use cases from the community
>>> because there are some big gaps on how we'd even start to shape the
>>> requirements.
>>> On Sun, Mar 29, 2020 at 1:28 PM Paul Parker <>
>>> wrote:
>>>> Hi Mike,
>>>> your alternate suggestion sounds good. But how does it work if I want
>>>> to keep this running continuously? In other words, the delta table should
>>>> be continuously updated. Finally, this is one of the biggest advantages of
>>>> Delta: you can ingest batch and streaming data into one table.
>>>> I also think about workarounds (Use Athena, Presto or Redshift with
>>>> Nifi):
>>>> "Here is the list of integrations that enable you to access Delta
>>>> tables from external data processing engines.
>>>>    - Presto and Athena to Delta Lake Integration
>>>>    <>
>>>>    - Redshift Spectrum to Delta Lake Integration
>>>>    <>
>>>>    - Snowflake to Delta Lake Integration
>>>>    <>
>>>>    - Apache Hive to Delta Lake Integration
>>>>    <>"
>>>> Source:
>>>> I am looking forward to further ideas from the community.
>>>> Mike Thomsen <> schrieb am So., 29. März 2020,
>>>> 17:23:
>>>>> I think there is a connector for Hive that works with Delta. You could
>>>>> try setting up Hive to work with Delta and then using NiFi to feed Hive.
>>>>> Alternatively, you can simply convert the results of the JDBC query into
>>>>> Parquet file, push to a desired location and run a Spark job to convert
>>>>> from Parquet to Delta (that should be pretty fast because Delta is
>>>>> basically a fork of Parquet).
>>>>> On Fri, Mar 27, 2020 at 2:13 PM Paul Parker <>
>>>>> wrote:
>>>>>> We read data via JDBC from a database and want to save the results
>>>>>> a delta table and then read them again. How can I realize this with
>>>>>> and Hive or Glue Metastore?

View raw message