From user-return-69259-apmail-spark-user-archive=spark.apache.org@spark.apache.org Fri Apr 21 21:30:49 2017 Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 65804180D7 for ; Fri, 21 Apr 2017 21:30:49 +0000 (UTC) Received: (qmail 34687 invoked by uid 500); 21 Apr 2017 21:30:42 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 34542 invoked by uid 500); 21 Apr 2017 21:30:42 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 34531 invoked by uid 99); 21 Apr 2017 21:30:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Apr 2017 21:30:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 47F1E1A02C4 for ; Fri, 21 Apr 2017 21:30:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.896 X-Spam-Level: X-Spam-Status: No, score=-0.896 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.796, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id GR7_VRWZcHJX for ; Fri, 21 Apr 2017 21:30:38 +0000 (UTC) Received: from mail-vk0-f43.google.com (mail-vk0-f43.google.com [209.85.213.43]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 37FAF5F23E for ; Fri, 21 Apr 2017 21:30:38 +0000 (UTC) Received: by mail-vk0-f43.google.com with SMTP id q78so17151176vke.3 for ; Fri, 21 Apr 2017 14:30:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=GmrItJ2y2vm5HvGKL0/8wScyxGekgFr/mhwl0DqyyJg=; b=JAFNBPSLyXQbR4yiv4sx7sswlZx0qEFeESKZJet5RPrQMGJ1+AizBpTiWwXQkrklzd Iiz4mHLDt5H3O99yhre5YRpQy1U1kKMhd9Q1HNHOfVWsCV+V0GU1v8XRDAzClET9NU5I 7A4ZketeqY+pRl+0w8Yvo/SI9dBuxMRbtm8ZoiUnDE4UuSO9KGNcukciVgbFbwV/vWRJ VoesPvaRN9hAaK8Y7f1W2RwuTFiXCOpqWclaKvGPQXb895KsJNkLzLGGTCjNezWlytCr TOgKSHon7Lo9UW/3iQduQk396qRucFvpTtLX2PHkh3GmA6w/uG8nGXcsO99Rp4GS6UlC 4zjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=GmrItJ2y2vm5HvGKL0/8wScyxGekgFr/mhwl0DqyyJg=; b=tODrguBL+8hZZ80ixuRR42pZ38DWcO+p7jcZ8bgccYrDKDs3WNAvZmj0lDmOxMd308 bCCuXC+UFoVtIZesoI5fmxw05HCT/zUqnsCn2N9ehzEFgyqEnq54AKnzhO+BIBgEpBRQ NG+U68y2D65a0fPSrUz9b+72txu/v7j+vRPNsLYT8NPw+d55G0de0fMu5dQimp0lYh4F dbCFC3iLfEZ20lPL6+AhEV1H5txJliKWi5+wzWtrGE8yUDUyQ5LXN+zrl5ZnKRrkiMzM yIpPvTLeB3KS7ScJgZ9mAAptQvhYUj0E+H+8tZGbgJ2E4lCGuYrbj028632L2nx8wuOp 2HAw== X-Gm-Message-State: AN3rC/7ittcI9f8if+YLAnO3IFAo6aDszkw7puPvr+cUBftWsCtxREv1 O27H9jJgytnxUSce8RLN1CQZZ9FgcQ== X-Received: by 10.31.49.3 with SMTP id x3mr6880878vkx.72.1492810231925; Fri, 21 Apr 2017 14:30:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.48.69 with HTTP; Fri, 21 Apr 2017 14:30:31 -0700 (PDT) In-Reply-To: References: <69A03AF3-B6DF-4797-A25C-2E900B93D73B@qvantel.com> <19E65CCF-D40F-4042-B90E-C8CF0541D4C4@qvantel.com> <154A3DBD-B688-4511-A340-66F7EE201A65@qvantel.com> From: Gene Pang Date: Fri, 21 Apr 2017 14:30:31 -0700 Message-ID: Subject: Re: Spark structured streaming: Is it possible to periodically refresh static data frame? To: Georg Heiler Cc: Hemanth Gudela , Tathagata Das , "user@spark.apache.org" Content-Type: multipart/alternative; boundary=001a114404b8975609054db3f782 --001a114404b8975609054db3f782 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Georg, Yes, that should be possible with Alluxio. Tachyon was renamed to Alluxio. This article on how Alluxio is used for a Spark streaming use case may be helpful. Thanks, Gene On Fri, Apr 21, 2017 at 8:22 AM, Georg Heiler wrote: > You could write your views to hive or maybe tachyon. > > Is the periodically updated data big? > > Hemanth Gudela schrieb am Fr. 21. Apr. 2017 > um 16:55: > >> Being new to spark, I think I need your suggestion again. >> >> >> >> #2 you can always define a batch Dataframe and register it as view, and >> then run a background then periodically creates a new Dataframe with >> updated data and re-registers it as a view with the same name >> >> >> >> I seem to have misunderstood your statement and tried registering static >> dataframe as a temp view (=E2=80=9CmyTempView=E2=80=9D) using createOrRe= placeView in one >> spark session, and tried re-registering another refreshed dataframe as t= emp >> view with same name (=E2=80=9CmyTempView=E2=80=9D) in another session. H= owever, with this >> approach, I have failed to achieve what I=E2=80=99m aiming for, because = views are >> local to one spark session. >> >> From spark 2.1.0 onwards, Global view is a nice feature, but still would >> not solve my problem, because global view cannot be updated. >> >> >> >> So after much thinking, I understood that you would have meant to use a >> background running process in the same spark job that would periodically >> create a new dataframe and re-register temp view with same name, within = the >> same spark session. >> >> Could you please give me some pointers to documentation on how to create >> such asynchronous background process in spark streaming? Is Scala=E2=80= =99s >> =E2=80=9CFutures=E2=80=9D the way to achieve this? >> >> >> >> Thanks, >> >> Hemanth >> >> >> >> >> >> *From: *Tathagata Das >> >> >> *Date: *Friday, 21 April 2017 at 0.03 >> *To: *Hemanth Gudela >> >> *Cc: *Georg Heiler , "user@spark.apache.org" = < >> user@spark.apache.org> >> >> >> *Subject: *Re: Spark structured streaming: Is it possible to >> periodically refresh static data frame? >> >> >> >> Here are couple of ideas. >> >> 1. You can set up a Structured Streaming query to update in-memory table= . >> >> Look at the memory sink in the programming guide - >> http://spark.apache.org/docs/latest/structured- >> streaming-programming-guide.html#output-sinks >> >> So you can query the latest table using a specified table name, and also >> join that table with another stream. However, note that this in-memory >> table is maintained in the driver, and so you have be careful about the >> size of the table. >> >> >> >> 2. If you cannot define a streaming query in the slow moving due to >> unavailability of connector for your streaming data source, then you can >> always define a batch Dataframe and register it as view, and then run a >> background then periodically creates a new Dataframe with updated data a= nd >> re-registers it as a view with the same name. Any streaming query that >> joins a streaming dataframe with the view will automatically start using >> the most updated data as soon as the view is updated. >> >> >> >> Hope this helps. >> >> >> >> >> >> On Thu, Apr 20, 2017 at 1:30 PM, Hemanth Gudela < >> hemanth.gudela@qvantel.com> wrote: >> >> Thanks Georg for your reply. >> >> But I=E2=80=99m not sure if I fully understood your answer. >> >> >> >> If you meant to join two streams (one reading Kafka, and another reading >> database table), then I think it=E2=80=99s not possible, because >> >> 1. According to documentation >> , >> Structured streaming does not support database as a streaming source >> >> 2. Joining between two streams is not possible yet. >> >> >> >> Regards, >> >> Hemanth >> >> >> >> *From: *Georg Heiler >> *Date: *Thursday, 20 April 2017 at 23.11 >> *To: *Hemanth Gudela , "user@spark.apache.or= g" >> >> *Subject: *Re: Spark structured streaming: Is it possible to >> periodically refresh static data frame? >> >> >> >> What about treating the static data as a (slow) stream as well? >> >> >> >> Hemanth Gudela schrieb am Do., 20. Apr. >> 2017 um 22:09 Uhr: >> >> Hello, >> >> >> >> I am working on a use case where there is a need to join streaming data >> frame with a static data frame. >> >> The streaming data frame continuously gets data from Kafka topics, >> whereas static data frame fetches data from a database table. >> >> >> >> However, as the underlying database table is getting updated often, I >> must somehow manage to refresh my static data frame periodically to get = the >> latest information from underlying database table. >> >> >> >> My questions: >> >> 1. Is it possible to periodically refresh static data frame? >> >> 2. If refreshing static data frame is not possible, is there a >> mechanism to automatically stop & restarting spark structured streaming >> job, so that every time the job restarts, the static data frame gets >> updated with latest information from underlying database table. >> >> 3. If 1) and 2) are not possible, please suggest alternatives to >> achieve my requirement described above. >> >> >> >> Thanks, >> >> Hemanth >> >> >> > --001a114404b8975609054db3f782 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Georg,

Yes, that should be possible = with Alluxio. Tachyon was renamed to Alluxio.


Thank= s,
Gene

On Fri, Apr 21, 2017 at 8:22 AM, Georg Heiler <georg.k= f.heiler@gmail.com> wrote:
= You could write your views to hive or maybe tachyon.

Is the periodi= cally updated data big?

Hemanth Gudela <hemanth.gudela@qvantel.com&= gt; schrieb am Fr. 21. Apr. 2017 um 16:55:

Being new to spark, I think I need your suggestion again.

=C2=A0

#2 you can always define a batch Datafr= ame and register it as view, and then run a background then periodically cr= eates a new Dataframe with updated data and re-registers it as a view with the same name<= /u>

=C2=A0

I seem to have misunderstood your statement and tried registering static d= ataframe as a temp view (=E2=80=9CmyTempView=E2=80=9D) using createOrReplac= eView in one spark session, and tried re-registering another refreshed dataframe as temp view with same name (=E2=80=9CmyTempVi= ew=E2=80=9D) in another session. However, with this approach, I have failed= to achieve what I=E2=80=99m aiming for, because views are local to one spa= rk session.

From spark 2.1.0 onwards, Global view is a nice feature, but still would n= ot solve my problem, because global view cannot be updated.

=C2=A0

So after much thinking, I understood that you would have meant to use a ba= ckground running process in the same spark job that would periodically crea= te a new dataframe and re-register temp view with same name, within the same spark session.

Could you please give me some pointers to documentation on how to create s= uch asynchronous background process in spark streaming? Is Scala=E2=80=99s = =E2=80=9CFutures=E2=80=9D the way to achieve this?

=C2=A0

Thanks,

Hemanth

=C2=A0

=C2=A0

F= rom: Tathagata Das <tathagata.das15= 65@gmail.com>


Date: Friday, 21 April 2017 at 0.03
To: Hemanth Gudela <hemanth.gudela@qvantel.com>


Subject: Re: Spark structured streaming: Is it possible to periodica= lly refresh static data frame?

=C2=A0

Here are couple of ideas.=C2=A0

1. You can set up a Structured Streaming query to up= date in-memory table.=C2=A0

So you can query the latest table using a specified = table name, and also join that table with another stream. However, note tha= t this in-memory table is maintained in the driver, and so you have be care= ful about the size of the table.

=C2=A0

2. If you cannot define a streaming query in the slo= w moving due to unavailability of connector for your streaming data source,= then you can always define a batch Dataframe and register it as view, and = then run a background then periodically creates a new Dataframe with updated data and re-registers it as a view wi= th the same name. Any streaming query that joins a streaming dataframe with= the view will automatically start using the most updated data as soon as t= he view is updated.

=C2=A0

Hope this helps.

=C2=A0

=C2=A0

On Thu, Apr 20, 2017 at 1:30 PM, Hemanth Gudela <= hemanth.gud= ela@qvantel.com> wrote:

Thanks Georg for your reply.

But I=E2=80=99m not sure if I fully understood your answer.<= u>

=C2=A0

If you meant to join two streams (one reading Kafka, and another reading d= atabase table), then I think it=E2=80=99s not possible, because

1.=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 According to documentation, Structured streaming does not support database as a stre= aming source

2.=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 Joining between= two streams is not possible yet.

=C2=A0

Regards,

Hemanth

=C2=A0

F= rom: Georg Heiler <= ;georg.kf.he= iler@gmail.com>
Date: Thursday, 20 April 2017 at 23.11
To: Hemanth Gudela <hemanth.gudela@qvantel.com>, "user@spark.apache.org"= ; <user@spark= .apache.org>
Subject: Re: Spark structured streaming: Is it possible to periodica= lly refresh static data frame?

=C2=A0

What about treating the static data as a (slow) stre= am as well?

=C2=A0

Hemanth Gudela <hemanth.gudela@qvantel.com> schrieb a= m Do., 20. Apr. 2017 um 22:09=C2=A0Uhr:

Hello,

=C2=A0

I am working on a u= se case where there is a need to join streaming data frame with a static da= ta frame.

The streaming data = frame continuously gets data from Kafka topics, whereas static data frame f= etches data from a database table.

=C2=A0

However, as the und= erlying database table is getting updated often, I must somehow manage to r= efresh my static data frame periodically to get the latest information from underlying database table.

=C2=A0

My questions:

1.= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Is it possible to periodically refresh sta= tic data frame?

2.= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 If refreshing static data frame is not pos= sible, is there a mechanism to automatically stop & restarting spark st= ructured streaming job, so that every time the job restarts, the static dat= a frame gets updated with latest information from underlying database table.

3.= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 If 1) and 2) are not possible, please sugg= est alternatives to achieve my requirement described above.

=C2=A0

Thanks,

Hemanth

=C2=A0


--001a114404b8975609054db3f782--