spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghava Mutharaju <m.vijayaragh...@gmail.com>
Subject strange HashPartitioner behavior in Spark
Date Sun, 17 Apr 2016 23:11:19 GMT
Hello All,

We are using HashPartitioner in the following way on a 3 node cluster (1
master and 2 worker nodes).

val u = sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int,
Int)](line => { line.split("\\|") match { case Array(x, y) => (y.toInt,
x.toInt) } }).partitionBy(new HashPartitioner(8)).setName("u").persist()

u.count()

If we run this from the spark shell, the data (52 MB) is split across the
two worker nodes. But if we put this in a scala program and run it, then
all the data goes to only one node. We have run it multiple times, but this
behavior does not change. This seems strange.

Is there some problem with the way we use HashPartitioner?

Thanks in advance.

Regards,
Raghava.

Mime
View raw message