spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neil Chang <iam...@gmail.com>
Subject How to partition a SparkDataFrame using all distinct column values in sparkR
Date Mon, 25 Jul 2016 22:46:25 GMT
Hi,
  This is a question regarding SparkR in spark 2.0.

Given that I have a SparkDataFrame and I want to partition it using one
column's values. Each value corresponds to a partition, all rows that
having the same column value shall go to the same partition, no more no
less.

   Seems the function repartition() doesn't do this, I have 394 unique
values, it just partitions my DataFrame into 200. If I specify the
numPartitions to 394, some mismatch happens.

Is it possible to do what I described in sparkR?
GroupBy doesn't work with udf at all.

Or can we split the DataFrame into list of small ones first, if so, what
can I use?

Thanks,
Neil

Mime
View raw message