spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?
Date Fri, 30 Dec 2016 07:28:22 GMT
"If data ingestion speed is faster than data production speed, then
eventually the entire database will be harvested and those workers will
start to "tail" the database for new data streams and the processing
becomes real time."

This part is really database dependent. So it will be hard to generalize
it. For example, say you have a batch interval of 10 secs....what happens
if you get more than one updates on the same row within 10 secs? You will
get a snapshot of every 10 secs. Now, different databases provide different
mechanisms to expose all DML changes, MySQL has binlogs, oracle has log
shipping, cdc,golden gate and so on....typically it requires new product or
new licenses and most likely new component installation on production db :)

So, if we keep real CDC solutions out of scope, a simple snapshot solution
can be achieved fairly easily by

1. Adding INSERTED_ON and UPDATED_ON columns on the source table(s).
2. Keeping a simple table level check pointing (TABLENAME,TS_MAX)
3. Running an extraction/load mechanism which will take data from DB (where
INSERTED_ON > TS_MAX or UPDATED_ON>TS_MAX) and put to HDFS. This can be
sqoop,spark,ETL tool like informatica,ODI,SAP etc. In addition, you can
directly write to Kafka as well. Sqoop, Spark supports Kafka. Most of the
ETL tools would too...
4. Finally, update check point...

You may "determine" checkpoint from the data you already have in HDFS if
you create a Hive structure on it.

Best
AYan



On Fri, Dec 30, 2016 at 4:45 PM, 任弘迪 <ryan.hd.ren@gmail.com> wrote:

> why not sync binlog of mysql(hopefully the data is immutable and the table
> is append-only), send the log through kafka and then consume it by spark
> streaming?
>
> On Fri, Dec 30, 2016 at 9:01 AM, Michael Armbrust <michael@databricks.com>
> wrote:
>
>> We don't support this yet, but I've opened this JIRA as it sounds
>> generally useful: https://issues.apache.org/jira/browse/SPARK-19031
>>
>> In the mean time you could try implementing your own Source, but that is
>> pretty low level and is not yet a stable API.
>>
>> On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe Yang (杨远哲)" <yyz1989@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> Thanks a lot for your contributions to bring us new technologies.
>>>
>>> I don't want to waste your time, so before I write to you, I googled,
>>> checked stackoverflow and mailing list archive with keywords "streaming"
>>> and "jdbc". But I was not able to get any solution to my use case. I hope I
>>> can get some clarification from you.
>>>
>>> The use case is quite straightforward, I need to harvest a relational
>>> database via jdbc, do something with data, and store result into Kafka. I
>>> am stuck at the first step, and the difficulty is as follows:
>>>
>>> 1. The database is too large to ingest with one thread.
>>> 2. The database is dynamic and time series data comes in constantly.
>>>
>>> Then an ideal workflow is that multiple workers process partitions of
>>> data incrementally according to a time window. For example, the processing
>>> starts from the earliest data with each batch containing data for one hour.
>>> If data ingestion speed is faster than data production speed, then
>>> eventually the entire database will be harvested and those workers will
>>> start to "tail" the database for new data streams and the processing
>>> becomes real time.
>>>
>>> With Spark SQL I can ingest data from a JDBC source with partitions
>>> divided by time windows, but how can I dynamically increment the time
>>> windows during execution? Assume that there are two workers ingesting data
>>> of 2017-01-01 and 2017-01-02, the one which finishes quicker gets next task
>>> for 2017-01-03. But I am not able to find out how to increment those values
>>> during execution.
>>>
>>> Then I looked into Structured Streaming. It looks much more promising
>>> because window operations based on event time are considered during
>>> streaming, which could be the solution to my use case. However, from
>>> documentation and code example I did not find anything related to streaming
>>> data from a growing database. Is there anything I can read to achieve my
>>> goal?
>>>
>>> Any suggestion is highly appreciated. Thank you very much and have a
>>> nice day.
>>>
>>> Best regards,
>>> Yang
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>
>


-- 
Best Regards,
Ayan Guha

Mime
View raw message