spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuanzhe Yang <>
Subject Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?
Date Tue, 03 Jan 2017 15:13:33 GMT
Hi Ayan,

Thanks a lot for your suggestion. I am currently looking into sqoop.

Concerning your suggestion for Spark, it is indeed parallelized with
multiple workers, but the job is one-off and cannot keep streaming.
Moreover, I cannot specify any "start row" in the job, it will always
ingest the entire table. So I also cannot simulate a streaming process by
starting the job in fix intervals...

Best regards,

2017-01-03 15:06 GMT+01:00 ayan guha <>:

> Hi
> While the solutions provided by others looks promising and I'd like to try
> out few of them, our old pal sqoop already "does" the job. It has a
> incremental mode where you can provide a --check-column and
> --last-modified-value combination to grab the data - and yes, sqoop
> essentially does it by running a MAP-only job which spawns number of
> parallel map task to grab data from DB.
> In Spark, you can use sqlContext.load function for JDBC and use
> partitionColumn and numPartition to define parallelism of connection.
> Best
> Ayan
> On Tue, Jan 3, 2017 at 10:49 PM, Yuanzhe Yang <> wrote:
>> Hi Ayan,
>> Thanks a lot for such a detailed response. I really appreciate it!
>> I think this use case can be generalized, because the data is immutable
>> and append-only. We only need to find one column or timestamp to track the
>> last row consumed in the previous ingestion. This pattern should be common
>> when storing sensor data. If the data is mutable, then the solution will be
>> surely difficult and vendor specific as you said.
>> The workflow you proposed is very useful. The difficulty part is how to
>> parallelize the ingestion task. With Spark when I have multiple workers
>> working on the same job, I don't know if there is a way and how to
>> dynamically change the row range each worker should process in realtime...
>> I tried to find out if there is any candidate available out of the box,
>> instead of reinventing the wheel. At this moment I have not discovered any
>> existing tool can parallelize ingestion tasks on one database. Is Sqoop a
>> proper candidate from your knowledge?
>> Thank you again and have a nice day.
>> Best regards,
>> Yang
>> 2016-12-30 8:28 GMT+01:00 ayan guha <>:
>>> "If data ingestion speed is faster than data production speed, then
>>> eventually the entire database will be harvested and those workers will
>>> start to "tail" the database for new data streams and the processing
>>> becomes real time."
>>> This part is really database dependent. So it will be hard to generalize
>>> it. For example, say you have a batch interval of 10 secs....what happens
>>> if you get more than one updates on the same row within 10 secs? You will
>>> get a snapshot of every 10 secs. Now, different databases provide different
>>> mechanisms to expose all DML changes, MySQL has binlogs, oracle has log
>>> shipping, cdc,golden gate and so on....typically it requires new product or
>>> new licenses and most likely new component installation on production db :)
>>> So, if we keep real CDC solutions out of scope, a simple snapshot
>>> solution can be achieved fairly easily by
>>> 1. Adding INSERTED_ON and UPDATED_ON columns on the source table(s).
>>> 2. Keeping a simple table level check pointing (TABLENAME,TS_MAX)
>>> 3. Running an extraction/load mechanism which will take data from DB
>>> (where INSERTED_ON > TS_MAX or UPDATED_ON>TS_MAX) and put to HDFS. This
>>> be sqoop,spark,ETL tool like informatica,ODI,SAP etc. In addition, you can
>>> directly write to Kafka as well. Sqoop, Spark supports Kafka. Most of the
>>> ETL tools would too...
>>> 4. Finally, update check point...
>>> You may "determine" checkpoint from the data you already have in HDFS if
>>> you create a Hive structure on it.
>>> Best
>>> AYan
>>> On Fri, Dec 30, 2016 at 4:45 PM, 任弘迪 <> wrote:
>>>> why not sync binlog of mysql(hopefully the data is immutable and the
>>>> table is append-only), send the log through kafka and then consume it by
>>>> spark streaming?
>>>> On Fri, Dec 30, 2016 at 9:01 AM, Michael Armbrust <
>>>>> wrote:
>>>>> We don't support this yet, but I've opened this JIRA as it sounds
>>>>> generally useful:
>>>>> In the mean time you could try implementing your own Source, but that
>>>>> is pretty low level and is not yet a stable API.
>>>>> On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe Yang (杨远哲)" <
>>>>>> wrote:
>>>>>> Hi all,
>>>>>> Thanks a lot for your contributions to bring us new technologies.
>>>>>> I don't want to waste your time, so before I write to you, I googled,
>>>>>> checked stackoverflow and mailing list archive with keywords "streaming"
>>>>>> and "jdbc". But I was not able to get any solution to my use case.
I hope I
>>>>>> can get some clarification from you.
>>>>>> The use case is quite straightforward, I need to harvest a relational
>>>>>> database via jdbc, do something with data, and store result into
Kafka. I
>>>>>> am stuck at the first step, and the difficulty is as follows:
>>>>>> 1. The database is too large to ingest with one thread.
>>>>>> 2. The database is dynamic and time series data comes in constantly.
>>>>>> Then an ideal workflow is that multiple workers process partitions
>>>>>> data incrementally according to a time window. For example, the processing
>>>>>> starts from the earliest data with each batch containing data for
one hour.
>>>>>> If data ingestion speed is faster than data production speed, then
>>>>>> eventually the entire database will be harvested and those workers
>>>>>> start to "tail" the database for new data streams and the processing
>>>>>> becomes real time.
>>>>>> With Spark SQL I can ingest data from a JDBC source with partitions
>>>>>> divided by time windows, but how can I dynamically increment the
>>>>>> windows during execution? Assume that there are two workers ingesting
>>>>>> of 2017-01-01 and 2017-01-02, the one which finishes quicker gets
next task
>>>>>> for 2017-01-03. But I am not able to find out how to increment those
>>>>>> during execution.
>>>>>> Then I looked into Structured Streaming. It looks much more promising
>>>>>> because window operations based on event time are considered during
>>>>>> streaming, which could be the solution to my use case. However, from
>>>>>> documentation and code example I did not find anything related to
>>>>>> data from a growing database. Is there anything I can read to achieve
>>>>>> goal?
>>>>>> Any suggestion is highly appreciated. Thank you very much and have
>>>>>> nice day.
>>>>>> Best regards,
>>>>>> Yang
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail:
>>> --
>>> Best Regards,
>>> Ayan Guha
> --
> Best Regards,
> Ayan Guha

View raw message