spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Haberman <stephen.haber...@gmail.com>
Subject Re: SPARK-942
Date Tue, 12 Nov 2013 19:07:26 GMT

> The problem is that the iterator interface only defines 'hasNext' and
> 'next' methods.

Just a comment from the peanut gallery, but FWIW it seems like being
able to ask "how much data is here" would be a useful thing for Spark
to know, even if that means moving away from Iterator itself, or
something like IteratorWithSizeEstimate/something/something.

Not only for this, but so that, ideally, Spark could basically do
dynamic partitioning.

E.g. when we load a month's worth of data, it's X GB, but after a few
maps and filters, it's X/100 GB, so could use X/100 partitions instead.

But right now all partitioning decisions are made up-front,
via .coalesce/etc. type hints from the programmer, and it seems if
Spark could delay making partitioning decisions each until RDD could
like lazily-eval/sample a few lines (hand waving), that would be super
sexy from our respective, in terms of doing automatic perf/partition
optimization.

Huge disclaimer that this is probably a big pita to implement, and
could likely not be as worthwhile as I naively think it would be.

- Stephen

Mime
View raw message