spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ANDREA SPINA <74...@studenti.unimore.it>
Subject Managing Dataset API Partitions - Spark 2.0
Date Wed, 07 Sep 2016 15:04:05 GMT
Hi everyone,
I'd test some algorithms with the Dataset API offered by Spark 2.0.0.

So I was wondering, *which is the best way for managing Dataset partitions?*

E.g. in the data reading phase, what I use to do is the following
*// RDD*
*// if I want to set a custom minimum number of partitions*
*val data = sc.textFile(inputPath, numPartitions)*

*// If I want to coalesce with a new shape my RDD at some point*
*sc.repartition(newNumPartitions)*

*// Dataset API*
*// Now with the Dataset API I'm calling directly the repartition method on
the dataset*
*spark.read.text(inputPath).repartition(newNumberOfPartition)*

So I'll be glad to know if there're *any new valuable about custom
partitioning dataset, either in the reading phase or at some point?*

Thank you so much.
Andrea
-- 
*Andrea Spina*
N.Tessera: *74598*
MAT: *89369*
*Ingegneria Informatica* *[LM] *(D.M. 270)

Mime
View raw message