spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <>
Subject Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?
Date Tue, 03 Jan 2017 17:14:00 GMT
Ahh I see what you mean....I confused two terminologies....because we were
talking about partitioning and then changed topic to identify changed data

For that, you can "construct" a dbtable as an inline view -

viewSQL = "(select * from table where <column> >
dbtable =viewSQL

refer to this

So, in summary, you have 2 things

1. Identify changed data - my suggestion to use dbtable with inline view
2. parallelism - use numPartition,lowerbound,upper bound to generate number
of partitions


On Wed, Jan 4, 2017 at 3:46 AM, Yuanzhe Yang <> wrote:

> Hi Ayan,
> Yeah, I understand your proposal, but according to here
> guide.html#jdbc-to-other-databases, it says
> Notice that lowerBound and upperBound are just used to decide the
> partition stride, not for filtering the rows in table. So all rows in the
> table will be partitioned and returned. This option applies only to reading.
> So my interpretation is all rows in the table are ingested, and this
> "lowerBound" and "upperBound" is the span of each partition. Well, I am not
> a native English speaker, maybe it means differently?
> Best regards,
> Yang
> 2017-01-03 17:23 GMT+01:00 ayan guha <>:
>> Hi
>> You need to store and capture the Max of the column you intend to use for
>> identifying new records (Ex: INSERTED_ON) after every successful run of
>> your job. Then, use the value in lowerBound option.
>> Essentially, you want to create a query like
>> select * from table where INSERTED_ON > lowerBound and
>> INSERTED_ON<upperBound
>> everytime you run the job....
>> On Wed, Jan 4, 2017 at 2:13 AM, Yuanzhe Yang <> wrote:
>>> Hi Ayan,
>>> Thanks a lot for your suggestion. I am currently looking into sqoop.
>>> Concerning your suggestion for Spark, it is indeed parallelized with
>>> multiple workers, but the job is one-off and cannot keep streaming.
>>> Moreover, I cannot specify any "start row" in the job, it will always
>>> ingest the entire table. So I also cannot simulate a streaming process by
>>> starting the job in fix intervals...
>>> Best regards,
>>> Yang
>>> 2017-01-03 15:06 GMT+01:00 ayan guha <>:
>>>> Hi
>>>> While the solutions provided by others looks promising and I'd like to
>>>> try out few of them, our old pal sqoop already "does" the job. It has a
>>>> incremental mode where you can provide a --check-column and
>>>> --last-modified-value combination to grab the data - and yes, sqoop
>>>> essentially does it by running a MAP-only job which spawns number of
>>>> parallel map task to grab data from DB.
>>>> In Spark, you can use sqlContext.load function for JDBC and use
>>>> partitionColumn and numPartition to define parallelism of connection.
>>>> Best
>>>> Ayan
>>>> On Tue, Jan 3, 2017 at 10:49 PM, Yuanzhe Yang <>
>>>> wrote:
>>>>> Hi Ayan,
>>>>> Thanks a lot for such a detailed response. I really appreciate it!
>>>>> I think this use case can be generalized, because the data is
>>>>> immutable and append-only. We only need to find one column or timestamp
>>>>> track the last row consumed in the previous ingestion. This pattern should
>>>>> be common when storing sensor data. If the data is mutable, then the
>>>>> solution will be surely difficult and vendor specific as you said.
>>>>> The workflow you proposed is very useful. The difficulty part is how
>>>>> to parallelize the ingestion task. With Spark when I have multiple workers
>>>>> working on the same job, I don't know if there is a way and how to
>>>>> dynamically change the row range each worker should process in realtime...
>>>>> I tried to find out if there is any candidate available out of the
>>>>> box, instead of reinventing the wheel. At this moment I have not discovered
>>>>> any existing tool can parallelize ingestion tasks on one database. Is
>>>>> a proper candidate from your knowledge?
>>>>> Thank you again and have a nice day.
>>>>> Best regards,
>>>>> Yang
>>>>> 2016-12-30 8:28 GMT+01:00 ayan guha <>:
>>>>>> "If data ingestion speed is faster than data production speed, then
>>>>>> eventually the entire database will be harvested and those workers
>>>>>> start to "tail" the database for new data streams and the processing
>>>>>> becomes real time."
>>>>>> This part is really database dependent. So it will be hard to
>>>>>> generalize it. For example, say you have a batch interval of 10
>>>>>> secs....what happens if you get more than one updates on the same
>>>>>> within 10 secs? You will get a snapshot of every 10 secs. Now, different
>>>>>> databases provide different mechanisms to expose all DML changes,
MySQL has
>>>>>> binlogs, oracle has log shipping, cdc,golden gate and so on....typically
>>>>>> requires new product or new licenses and most likely new component
>>>>>> installation on production db :)
>>>>>> So, if we keep real CDC solutions out of scope, a simple snapshot
>>>>>> solution can be achieved fairly easily by
>>>>>> 1. Adding INSERTED_ON and UPDATED_ON columns on the source table(s).
>>>>>> 2. Keeping a simple table level check pointing (TABLENAME,TS_MAX)
>>>>>> 3. Running an extraction/load mechanism which will take data from
>>>>>> (where INSERTED_ON > TS_MAX or UPDATED_ON>TS_MAX) and put to
HDFS. This can
>>>>>> be sqoop,spark,ETL tool like informatica,ODI,SAP etc. In addition,
you can
>>>>>> directly write to Kafka as well. Sqoop, Spark supports Kafka. Most
of the
>>>>>> ETL tools would too...
>>>>>> 4. Finally, update check point...
>>>>>> You may "determine" checkpoint from the data you already have in
>>>>>> if you create a Hive structure on it.
>>>>>> Best
>>>>>> AYan
>>>>>> On Fri, Dec 30, 2016 at 4:45 PM, 任弘迪 <>
>>>>>>> why not sync binlog of mysql(hopefully the data is immutable
and the
>>>>>>> table is append-only), send the log through kafka and then consume
it by
>>>>>>> spark streaming?
>>>>>>> On Fri, Dec 30, 2016 at 9:01 AM, Michael Armbrust <
>>>>>>>> wrote:
>>>>>>>> We don't support this yet, but I've opened this JIRA as it
>>>>>>>> generally useful:
>>>>>>>> In the mean time you could try implementing your own Source,
>>>>>>>> that is pretty low level and is not yet a stable API.
>>>>>>>> On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe Yang (杨远哲)"
>>>>>>>>> wrote:
>>>>>>>>> Hi all,
>>>>>>>>> Thanks a lot for your contributions to bring us new technologies.
>>>>>>>>> I don't want to waste your time, so before I write to
you, I
>>>>>>>>> googled, checked stackoverflow and mailing list archive
with keywords
>>>>>>>>> "streaming" and "jdbc". But I was not able to get any
solution to my use
>>>>>>>>> case. I hope I can get some clarification from you.
>>>>>>>>> The use case is quite straightforward, I need to harvest
>>>>>>>>> relational database via jdbc, do something with data,
and store result into
>>>>>>>>> Kafka. I am stuck at the first step, and the difficulty
is as follows:
>>>>>>>>> 1. The database is too large to ingest with one thread.
>>>>>>>>> 2. The database is dynamic and time series data comes
>>>>>>>>> constantly.
>>>>>>>>> Then an ideal workflow is that multiple workers process
>>>>>>>>> of data incrementally according to a time window. For
example, the
>>>>>>>>> processing starts from the earliest data with each batch
containing data
>>>>>>>>> for one hour. If data ingestion speed is faster than
data production speed,
>>>>>>>>> then eventually the entire database will be harvested
and those workers
>>>>>>>>> will start to "tail" the database for new data streams
and the processing
>>>>>>>>> becomes real time.
>>>>>>>>> With Spark SQL I can ingest data from a JDBC source with
>>>>>>>>> partitions divided by time windows, but how can I dynamically
increment the
>>>>>>>>> time windows during execution? Assume that there are
two workers ingesting
>>>>>>>>> data of 2017-01-01 and 2017-01-02, the one which finishes
quicker gets next
>>>>>>>>> task for 2017-01-03. But I am not able to find out how
to increment those
>>>>>>>>> values during execution.
>>>>>>>>> Then I looked into Structured Streaming. It looks much
>>>>>>>>> promising because window operations based on event time
are considered
>>>>>>>>> during streaming, which could be the solution to my use
case. However, from
>>>>>>>>> documentation and code example I did not find anything
related to streaming
>>>>>>>>> data from a growing database. Is there anything I can
read to achieve my
>>>>>>>>> goal?
>>>>>>>>> Any suggestion is highly appreciated. Thank you very
much and have
>>>>>>>>> a nice day.
>>>>>>>>> Best regards,
>>>>>>>>> Yang
>>>>>>>>> ------------------------------------------------------------
>>>>>>>>> ---------
>>>>>>>>> To unsubscribe e-mail:
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>> --
>> Best Regards,
>> Ayan Guha

Best Regards,
Ayan Guha

View raw message