From user-return-74764-apmail-spark-user-archive=spark.apache.org@spark.apache.org Thu May 3 16:55:16 2018 Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 52B8D18707 for ; Thu, 3 May 2018 16:55:16 +0000 (UTC) Received: (qmail 9922 invoked by uid 500); 3 May 2018 16:55:10 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 9801 invoked by uid 500); 3 May 2018 16:55:10 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 9791 invoked by uid 99); 3 May 2018 16:55:10 -0000 Received: from mail-relay.apache.org (HELO mailrelay2-lw-us.apache.org) (207.244.88.137) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 May 2018 16:55:10 +0000 Received: from [10.22.8.47] (unknown [192.175.27.10]) by mailrelay2-lw-us.apache.org (ASF Mail Server at mailrelay2-lw-us.apache.org) with ESMTPSA id EC490405C; Thu, 3 May 2018 16:55:08 +0000 (UTC) User-Agent: Microsoft-MacOutlook/f.16.0.160506 Date: Thu, 03 May 2018 09:55:06 -0700 Subject: Re: question on collect_list or say aggregations in general in structured streaming 2.3.0 From: Arun Mahadevan Sender: Arun Iyer To: kant kodali , "user @spark" Message-ID: Thread-Topic: question on collect_list or say aggregations in general in structured streaming 2.3.0 References: In-Reply-To: Mime-version: 1.0 Content-type: multipart/alternative; boundary="B_3608186109_1463560285" --B_3608186109_1463560285 Content-type: text/plain; charset="UTF-8" Content-transfer-encoding: 7bit I think you need to group by a window (tumbling) and define watermarks (put a very low watermark or even 0) to discard the state. Here the window duration becomes your logical batch. - Arun From: kant kodali Date: Thursday, May 3, 2018 at 1:52 AM To: "user @spark" Subject: Re: question on collect_list or say aggregations in general in structured streaming 2.3.0 After doing some more research using Google. It's clear that aggregations by default are stateful in Structured Streaming. so the question now is how to do stateless aggregations(not storing the result from previous batches) using Structured Streaming 2.3.0? I am trying to do it using raw spark SQL so not using FlatMapsGroupWithState. And if that is not available then is it fair to say there is no declarative way to do stateless aggregations? On Thu, May 3, 2018 at 1:24 AM, kant kodali wrote: Hi All, I was under an assumption that one needs to run grouby(window(...)) to run any stateful operations but looks like that is not the case since any aggregation like query "select count(*) from some_view" is also stateful since it stores the result of the count from the previous batch. Likewise, if I do "select collect_list(*) from some_view" with say maxOffsetsTrigger set to 1 I can see the rows from the previous batch at every trigger. so is it fair to say aggregations by default are stateful? I am looking more like DStream like an approach(stateless) where I want to collect bunch of records on each batch do some aggregation like say count and throw the result out and next batch it should only count from that batch only but not from the previous batch. so If I run "select collect_list(*) from some_view" I want to collect whatever rows are available at each batch/trigger but not from the previous batch. How do I do that? Thanks! --B_3608186109_1463560285 Content-type: text/html; charset="UTF-8" Content-transfer-encoding: quoted-printable
I think you need to grou= p by a window (tumbling) and define watermarks (put a very low watermark or = even 0) to discard the state. Here the window duration becomes your logical = batch.

- Arun

From: kant kodali <kanth909@= gmail.com>
Date: Thursday, = May 3, 2018 at 1:52 AM
To: "user @= spark" <user@spark.apache.org&= gt;
Subject: Re: question on colle= ct_list or say aggregations in general in structured streaming 2.3.0

After doing some more research using Google. It's clear that aggregations = by default are stateful in Structured Streaming. so the question now is how = to do stateless aggregations(not storing the result from previous batches) u= sing Structured Streaming 2.3.0? I am trying to do it using raw spark SQL so not using Fla= tMapsGroupWithState. And if that is not available then is it fair to say the= re is no declarative way to do stateless aggregations?
--B_3608186109_1463560285--