spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fernando Pereira <>
Subject Re: Multiple transformations without recalculating or caching
Date Fri, 17 Nov 2017 10:11:38 GMT
Notice the fact that I have 1+ TB. If I didn't mind things to be slow I
wouldn't be using spark.

On 17 November 2017 at 11:06, Sebastian Piu <> wrote:

> If you don't want to recalculate you need to hold the results somewhere,
> of you need to save it why don't you so that and then read it again and get
> your stats?
> On Fri, 17 Nov 2017, 10:03 Fernando Pereira, <> wrote:
>> Dear Spark users
>> Is it possible to take the output of a transformation (RDD/Dataframe) and
>> feed it to two independent transformations without recalculating the first
>> transformation and without caching the whole dataset?
>> Consider the case of a very large dataset (1+TB) which suffered several
>> transformations and now we want to save it but also calculate some
>> statistics per group.
>> So the best processing way would for: for each partition: do task A, do
>> task B.
>> I don't see a way of instructing spark how to proceed that way without
>> caching to disk, which seems unnecessarily heavy. And if we don't cache
>> spark recalculates every partition all the way from the beginning. In
>> either case huge file reads happen.
>> Any ideas on how to avoid it?
>> Thanks
>> Fernando

View raw message