spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amrit Jangid <>
Subject Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?
Date Tue, 03 Jan 2017 11:25:49 GMT
You can try out *debezium* : it reads data
from bin-logs, provides structure and stream into Kafka.

Now Kafka can be your new source for streaming.

On Tue, Jan 3, 2017 at 4:36 PM, Yuanzhe Yang <> wrote:

> Hi Hongdi,
> Thanks a lot for your suggestion. The data is truely immutable and the
> table is append-only. But actually there are different databases involved,
> so the only feature they share in common and I can depend on is jdbc...
> Best regards,
> Yang
> 2016-12-30 6:45 GMT+01:00 任弘迪 <>:
>> why not sync binlog of mysql(hopefully the data is immutable and the
>> table is append-only), send the log through kafka and then consume it by
>> spark streaming?
>> On Fri, Dec 30, 2016 at 9:01 AM, Michael Armbrust <
>> > wrote:
>>> We don't support this yet, but I've opened this JIRA as it sounds
>>> generally useful:
>>> In the mean time you could try implementing your own Source, but that is
>>> pretty low level and is not yet a stable API.
>>> On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe Yang (杨远哲)" <
>>> > wrote:
>>>> Hi all,
>>>> Thanks a lot for your contributions to bring us new technologies.
>>>> I don't want to waste your time, so before I write to you, I googled,
>>>> checked stackoverflow and mailing list archive with keywords "streaming"
>>>> and "jdbc". But I was not able to get any solution to my use case. I hope
>>>> can get some clarification from you.
>>>> The use case is quite straightforward, I need to harvest a relational
>>>> database via jdbc, do something with data, and store result into Kafka. I
>>>> am stuck at the first step, and the difficulty is as follows:
>>>> 1. The database is too large to ingest with one thread.
>>>> 2. The database is dynamic and time series data comes in constantly.
>>>> Then an ideal workflow is that multiple workers process partitions of
>>>> data incrementally according to a time window. For example, the processing
>>>> starts from the earliest data with each batch containing data for one hour.
>>>> If data ingestion speed is faster than data production speed, then
>>>> eventually the entire database will be harvested and those workers will
>>>> start to "tail" the database for new data streams and the processing
>>>> becomes real time.
>>>> With Spark SQL I can ingest data from a JDBC source with partitions
>>>> divided by time windows, but how can I dynamically increment the time
>>>> windows during execution? Assume that there are two workers ingesting data
>>>> of 2017-01-01 and 2017-01-02, the one which finishes quicker gets next task
>>>> for 2017-01-03. But I am not able to find out how to increment those values
>>>> during execution.
>>>> Then I looked into Structured Streaming. It looks much more promising
>>>> because window operations based on event time are considered during
>>>> streaming, which could be the solution to my use case. However, from
>>>> documentation and code example I did not find anything related to streaming
>>>> data from a growing database. Is there anything I can read to achieve my
>>>> goal?
>>>> Any suggestion is highly appreciated. Thank you very much and have a
>>>> nice day.
>>>> Best regards,
>>>> Yang
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail:


Data Team

View raw message