spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gene Pang <gene.p...@gmail.com>
Subject Re: Spark structured streaming: Is it possible to periodically refresh static data frame?
Date Fri, 21 Apr 2017 21:30:31 GMT
Hi Georg,

Yes, that should be possible with Alluxio. Tachyon was renamed to Alluxio.

This article on how Alluxio is used for a Spark streaming use case
<https://www.alluxio.com/blog/qunar-performs-real-time-data-analytics-up-to-300x-faster-with-alluxio>
may be helpful.

Thanks,
Gene

On Fri, Apr 21, 2017 at 8:22 AM, Georg Heiler <georg.kf.heiler@gmail.com>
wrote:

> You could write your views to hive or maybe tachyon.
>
> Is the periodically updated data big?
>
> Hemanth Gudela <hemanth.gudela@qvantel.com> schrieb am Fr. 21. Apr. 2017
> um 16:55:
>
>> Being new to spark, I think I need your suggestion again.
>>
>>
>>
>> #2 you can always define a batch Dataframe and register it as view, and
>> then run a background then periodically creates a new Dataframe with
>> updated data and re-registers it as a view with the same name
>>
>>
>>
>> I seem to have misunderstood your statement and tried registering static
>> dataframe as a temp view (“myTempView”) using createOrReplaceView in one
>> spark session, and tried re-registering another refreshed dataframe as temp
>> view with same name (“myTempView”) in another session. However, with this
>> approach, I have failed to achieve what I’m aiming for, because views are
>> local to one spark session.
>>
>> From spark 2.1.0 onwards, Global view is a nice feature, but still would
>> not solve my problem, because global view cannot be updated.
>>
>>
>>
>> So after much thinking, I understood that you would have meant to use a
>> background running process in the same spark job that would periodically
>> create a new dataframe and re-register temp view with same name, within the
>> same spark session.
>>
>> Could you please give me some pointers to documentation on how to create
>> such asynchronous background process in spark streaming? Is Scala’s
>> “Futures” the way to achieve this?
>>
>>
>>
>> Thanks,
>>
>> Hemanth
>>
>>
>>
>>
>>
>> *From: *Tathagata Das <tathagata.das1565@gmail.com>
>>
>>
>> *Date: *Friday, 21 April 2017 at 0.03
>> *To: *Hemanth Gudela <hemanth.gudela@qvantel.com>
>>
>> *Cc: *Georg Heiler <georg.kf.heiler@gmail.com>, "user@spark.apache.org" <
>> user@spark.apache.org>
>>
>>
>> *Subject: *Re: Spark structured streaming: Is it possible to
>> periodically refresh static data frame?
>>
>>
>>
>> Here are couple of ideas.
>>
>> 1. You can set up a Structured Streaming query to update in-memory table.
>>
>> Look at the memory sink in the programming guide -
>> http://spark.apache.org/docs/latest/structured-
>> streaming-programming-guide.html#output-sinks
>>
>> So you can query the latest table using a specified table name, and also
>> join that table with another stream. However, note that this in-memory
>> table is maintained in the driver, and so you have be careful about the
>> size of the table.
>>
>>
>>
>> 2. If you cannot define a streaming query in the slow moving due to
>> unavailability of connector for your streaming data source, then you can
>> always define a batch Dataframe and register it as view, and then run a
>> background then periodically creates a new Dataframe with updated data and
>> re-registers it as a view with the same name. Any streaming query that
>> joins a streaming dataframe with the view will automatically start using
>> the most updated data as soon as the view is updated.
>>
>>
>>
>> Hope this helps.
>>
>>
>>
>>
>>
>> On Thu, Apr 20, 2017 at 1:30 PM, Hemanth Gudela <
>> hemanth.gudela@qvantel.com> wrote:
>>
>> Thanks Georg for your reply.
>>
>> But I’m not sure if I fully understood your answer.
>>
>>
>>
>> If you meant to join two streams (one reading Kafka, and another reading
>> database table), then I think it’s not possible, because
>>
>> 1.       According to documentation
>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#data-sources>,
>> Structured streaming does not support database as a streaming source
>>
>> 2.       Joining between two streams is not possible yet.
>>
>>
>>
>> Regards,
>>
>> Hemanth
>>
>>
>>
>> *From: *Georg Heiler <georg.kf.heiler@gmail.com>
>> *Date: *Thursday, 20 April 2017 at 23.11
>> *To: *Hemanth Gudela <hemanth.gudela@qvantel.com>, "user@spark.apache.org"
>> <user@spark.apache.org>
>> *Subject: *Re: Spark structured streaming: Is it possible to
>> periodically refresh static data frame?
>>
>>
>>
>> What about treating the static data as a (slow) stream as well?
>>
>>
>>
>> Hemanth Gudela <hemanth.gudela@qvantel.com> schrieb am Do., 20. Apr.
>> 2017 um 22:09 Uhr:
>>
>> Hello,
>>
>>
>>
>> I am working on a use case where there is a need to join streaming data
>> frame with a static data frame.
>>
>> The streaming data frame continuously gets data from Kafka topics,
>> whereas static data frame fetches data from a database table.
>>
>>
>>
>> However, as the underlying database table is getting updated often, I
>> must somehow manage to refresh my static data frame periodically to get the
>> latest information from underlying database table.
>>
>>
>>
>> My questions:
>>
>> 1.       Is it possible to periodically refresh static data frame?
>>
>> 2.       If refreshing static data frame is not possible, is there a
>> mechanism to automatically stop & restarting spark structured streaming
>> job, so that every time the job restarts, the static data frame gets
>> updated with latest information from underlying database table.
>>
>> 3.       If 1) and 2) are not possible, please suggest alternatives to
>> achieve my requirement described above.
>>
>>
>>
>> Thanks,
>>
>> Hemanth
>>
>>
>>
>

Mime
View raw message