spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evan Chan ...@ooyala.com>
Subject Re: SPARK-942
Date Thu, 14 Nov 2013 17:48:07 GMT
+1 for IteratorWithSizeEstimate.

I believe today only HadoopRDDs are able to give fine grained
progress;  with an enhanced iterator interface (which can still expose
the base Iterator trait) we can extend the possibility of fine grained
progress to all RDDs that implement the enhanced iterator.

On Tue, Nov 12, 2013 at 11:07 AM, Stephen Haberman
<stephen.haberman@gmail.com> wrote:
>
>> The problem is that the iterator interface only defines 'hasNext' and
>> 'next' methods.
>
> Just a comment from the peanut gallery, but FWIW it seems like being
> able to ask "how much data is here" would be a useful thing for Spark
> to know, even if that means moving away from Iterator itself, or
> something like IteratorWithSizeEstimate/something/something.
>
> Not only for this, but so that, ideally, Spark could basically do
> dynamic partitioning.
>
> E.g. when we load a month's worth of data, it's X GB, but after a few
> maps and filters, it's X/100 GB, so could use X/100 partitions instead.
>
> But right now all partitioning decisions are made up-front,
> via .coalesce/etc. type hints from the programmer, and it seems if
> Spark could delay making partitioning decisions each until RDD could
> like lazily-eval/sample a few lines (hand waving), that would be super
> sexy from our respective, in terms of doing automatic perf/partition
> optimization.
>
> Huge disclaimer that this is probably a big pita to implement, and
> could likely not be as worthwhile as I naively think it would be.
>
> - Stephen



-- 
--
Evan Chan
Staff Engineer
ev@ooyala.com  |

Mime
View raw message