spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <>
Subject Re: handling of empty partitions
Date Mon, 09 Jan 2017 05:40:25 GMT
Hi Georg,

Thanks for the question along with the code (as well as posting to stack
overflow). In general if a question is well suited for stackoverflow its
probably better suited to the user@ list instead of the dev@ list so I've
cc'd the user@ list for you.

As far as handling empty partitions when working mapPartitions (and
similar), the general approach is to return an empty iterator of the
correct type when you have an empty input iterator.

It looks like your code is doing this, however it seems like you likely
have a bug in your application logic (namely it assumes that if a partition
has a record missing a value it will either have had a previous row in the
same partition which is good OR that the previous partition is not empty
and has a good row - which need not necessarily be the case). You've
partially fixed this problem by going through and for each partition
collecting the last previous good value, and then if you don't have a good
value at the start of a partition look up the value in the collected array.

However, if this also happens at the same time the previous partition is
empty, you will need to go and lookup the previous previous partition value
until you find the one you are looking for. (Note this assumes that the
first record in your dataset is valid, if it isn't your code will still

Your solution is really close to working but just has some minor
assumptions which don't always necessarily hold.


Holden :)

On Sun, Jan 8, 2017 at 8:30 PM, Liang-Chi Hsieh <> wrote:

> Hi Georg,
> Can you describe your question more clear?
> Actually, the example codes you posted in stackoverflow doesn't crash as
> you
> said in the post.
> geoHeil wrote
> > I am working on building a custom ML pipeline-model / estimator to impute
> > missing values, e.g. I want to fill with last good known value.
> > Using a window function is slow / will put the data into a single
> > partition.
> > I built some sample code to use the RDD API however, it some None / null
> > problems with empty partitions.
> >
> > How should this be implemented properly to handle such empty partitions?
> >
> mappartitionswithindex-handling-empty-partitions
> >
> > Kind regards,
> > Georg
> -----
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> --
> View this message in context: http://apache-spark-
> partitions-tp20496p20515.html
> Sent from the Apache Spark Developers List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe e-mail:

Cell : 425-233-8271

View raw message