spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuanzhe Yang <>
Subject Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?
Date Tue, 03 Jan 2017 11:03:30 GMT
Hi Michael,

Thanks a lot for your ticket. At least it is the first step.

Best regards,

2016-12-30 2:01 GMT+01:00 Michael Armbrust <>:

> We don't support this yet, but I've opened this JIRA as it sounds
> generally useful:
> In the mean time you could try implementing your own Source, but that is
> pretty low level and is not yet a stable API.
> On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe Yang (杨远哲)" <>
> wrote:
>> Hi all,
>> Thanks a lot for your contributions to bring us new technologies.
>> I don't want to waste your time, so before I write to you, I googled,
>> checked stackoverflow and mailing list archive with keywords "streaming"
>> and "jdbc". But I was not able to get any solution to my use case. I hope I
>> can get some clarification from you.
>> The use case is quite straightforward, I need to harvest a relational
>> database via jdbc, do something with data, and store result into Kafka. I
>> am stuck at the first step, and the difficulty is as follows:
>> 1. The database is too large to ingest with one thread.
>> 2. The database is dynamic and time series data comes in constantly.
>> Then an ideal workflow is that multiple workers process partitions of
>> data incrementally according to a time window. For example, the processing
>> starts from the earliest data with each batch containing data for one hour.
>> If data ingestion speed is faster than data production speed, then
>> eventually the entire database will be harvested and those workers will
>> start to "tail" the database for new data streams and the processing
>> becomes real time.
>> With Spark SQL I can ingest data from a JDBC source with partitions
>> divided by time windows, but how can I dynamically increment the time
>> windows during execution? Assume that there are two workers ingesting data
>> of 2017-01-01 and 2017-01-02, the one which finishes quicker gets next task
>> for 2017-01-03. But I am not able to find out how to increment those values
>> during execution.
>> Then I looked into Structured Streaming. It looks much more promising
>> because window operations based on event time are considered during
>> streaming, which could be the solution to my use case. However, from
>> documentation and code example I did not find anything related to streaming
>> data from a growing database. Is there anything I can read to achieve my
>> goal?
>> Any suggestion is highly appreciated. Thank you very much and have a nice
>> day.
>> Best regards,
>> Yang
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail:

View raw message