spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: distinct on huge dataset
Date Mon, 24 Mar 2014 03:01:37 GMT
Ah, interesting. count() without distinct is streaming and does not require
that a single partition fits in memory, for instance. That said, the
behavior may change if you increase the number of partitions in your input
RDD by using RDD.repartition()


On Sun, Mar 23, 2014 at 11:47 AM, Kane <kane.isturm@gmail.com> wrote:

> Yes, there was an error in data, after fixing it - count fails with Out of
> Memory Error.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Mime
View raw message