spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 张万新 <kevinzwx1...@gmail.com>
Subject Re: [Structured Streaming]Data processing and output trigger should be decoupled
Date Thu, 31 Aug 2017 17:21:07 GMT
I think something like state store can be used to keep the intermediate
data. For aggregations the engines keeps processing batches of data and
update the results in state store(or sth like this), and when a trigger
begins the engines just fetch the current result from state store and
output it to the sink specified by users.

Or at least another way, if the processing time is shorter than the trigger
interval, can there be a way to for the engine to first complete most of
jobs or stages, and when the trigger starts, the final job or stages are
done to get the final result and output it to the sink?

Shixiong(Ryan) Zhu <shixiong@databricks.com>于2017年8月31日周四 上午1:59写道:

> I don't think that's a good idea. If the engine keeps on processing data
> but doesn't output anything, where to keep the intermediate data?
>
> On Wed, Aug 30, 2017 at 9:26 AM, KevinZwx <kevinzwx1992@gmail.com> wrote:
>
>> Hi,
>>
>> I'm working with structured streaming, and I'm wondering whether there
>> should be some improvements about trigger.
>>
>> Currently, when I specify a trigger, i.e.
>> tigger(Trigger.ProcessingTime("10
>> minutes")), the engine will begin processing data at the time the trigger
>> begins, like 10:00:00, 10:10:00, 10:20:00,..., etc, if the engine takes
>> 10s
>> to process this batch of data, then we will get the output result at
>> 10:00:10...,  then the engine just waits without processing any data. When
>> the next trigger begins, the engine begins to process the data during the
>> interval, and if this time the engine takes 15s to process the batch, we
>> will get result at 10:10:15. This is the problem.
>>
>> In my understanding, the trigger and data processing should be decoupled,
>> the engine should keep on processing data as fast as possible, but only
>> generate output results at each trigger, therefore we can get the result
>> at
>> 10:00:00, 10:10:00, 10:20:00, ... So I'm wondering if there is any
>> solution
>> or plan to work on this?
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>

Mime
View raw message