spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <tathagata.das1...@gmail.com>
Subject Re: Spark structured streaming: periodically refresh static data frame
Date Thu, 15 Feb 2018 02:04:49 GMT
1. Just loop like this.


def startQuery(): Streaming Query = {
   // Define the dataframes and start the query
}

// call this on main thread
while (notShutdown) {
   val query = startQuery()
   query.awaitTermination(refreshIntervalMs)
   query.stop()
   // refresh static data
}


2. Yes, stream-stream joins in 2.3.0, soon to be released. RC3 is available
if you want to test it right now -
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc3-bin/.



On Wed, Feb 14, 2018 at 3:34 AM, Appu K <kutt4n@gmail.com> wrote:

> TD,
>
> Thanks a lot for the quick reply :)
>
>
> Did I understand it right that in the main thread, to wait for the
> termination of the context I'll not be able to use
>  outStream.awaitTermination()  -  [ since i'll be closing in inside another
> thread ]
>
> What would be a good approach to keep the main app long running if I’ve to
> restart queries?
>
> Should i just wait for 2.3 where i'll be able to join two structured
> streams ( if the release is just a few weeks away )
>
> Appreciate all the help!
>
> thanks
> App
>
>
>
> On 14 February 2018 at 4:41:52 PM, Tathagata Das (
> tathagata.das1565@gmail.com) wrote:
>
> Let me fix my mistake :)
> What I suggested in that earlier thread does not work. The streaming query
> that joins a streaming dataset with a batch view, does not correctly pick
> up when the view is updated. It works only when you restart the query. That
> is,
> - stop the query
> - recreate the dataframes,
> - start the query on the new dataframe using the same checkpoint location
> as the previous query
>
> Note that you dont need to restart the whole process/cluster/application,
> just restart the query in the same process/cluster/application. This should
> be very fast (within a few seconds). So, unless you have latency SLAs of 1
> second, you can periodically restart the query without restarting the
> process.
>
> Apologies for my misdirections in that earlier thread. Hope this helps.
>
> TD
>
> On Wed, Feb 14, 2018 at 2:57 AM, Appu K <kutt4n@gmail.com> wrote:
>
>> More specifically,
>>
>> Quoting TD from the previous thread
>> "Any streaming query that joins a streaming dataframe with the view will
>> automatically start using the most updated data as soon as the view is
>> updated”
>>
>> Wondering if I’m doing something wrong in  https://gist.github.com/anony
>> mous/90dac8efadca3a69571e619943ddb2f6
>>
>> My streaming dataframe is not using the updated data, even though the
>> view is updated!
>>
>> Thank you
>>
>>
>> On 14 February 2018 at 2:54:48 PM, Appu K (kutt4n@gmail.com) wrote:
>>
>> Hi,
>>
>> I had followed the instructions from the thread https://mail-archives.a
>> pache.org/mod_mbox/spark-user/201704.mbox/%3CD1315D33-41CD-
>> 4BA3-8B77-0879F3669352@qvantel.com%3E while trying to reload a static
>> data frame periodically that gets joined to a structured streaming query.
>>
>> However, the streaming query results does not reflect the data from the
>> refreshed static data frame.
>>
>> Code is here https://gist.github.com/anonymous/90dac8efadca3a69571e6
>> 19943ddb2f6
>>
>> I’m using spark 2.2.1 . Any pointers would be highly helpful
>>
>> Thanks a lot
>>
>> Appu
>>
>>
>

Mime
View raw message