spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hemant Bhanawat <hemant9...@gmail.com>
Subject Re: Sorting on a streaming dataframe
Date Fri, 13 Apr 2018 06:42:59 GMT
Well, we want to assign snapshot ids (incrementing counters) to the
incoming records. For that, we are zipping the streaming rdds with that
counter using a modified version of ZippedWithIndexRDD. We are ok if the
records in the streaming dataframe gets counters in random order but the
counter should always be incrementing.

This is working fine until we have a failure. When we have a failure, we
re-assign the records to snapshot ids  and this time same snapshot id can
get assigned to a different record. This is a problem because the primary
key in our storage engine is <recordid, snapshotid>. So we want to sort the
dataframe so that the records always get the same snapshot id.



On Fri, Apr 13, 2018 at 11:43 AM, Reynold Xin <rxin@databricks.com> wrote:

> Can you describe your use case more?
>
> On Thu, Apr 12, 2018 at 11:12 PM Hemant Bhanawat <hemant9379@gmail.com>
> wrote:
>
>> Hi Guys,
>>
>> Why is sorting on streaming dataframes not supported(unless it is
>> complete mode)? My downstream needs me to sort the streaming dataframe.
>>
>> Hemant
>>
>

Mime
View raw message