Thanks. Repartitioning to a smaller number of partitions s= eems to fix my issue, but I'll keep broadcasting in mind (droprows is a= n integer array with about 4 million entries).

On Wed, Aug 5, 2015 at 12:34 PM, Philip = Weaver wrote:
How big is droprows?

Try explicitly broadcasting it like this:

val valsrows = =3D ...
=C2=A0 =C2=A0 .filter(x =3D> !broadcastDropRows.value.= contains(x._1))
- Philip

On Wed, Aug 5, 2015 at 11:54 AM, AlexG &= lt;swiftset@gmail.c= om> wrote:
I'm trying t= o load a 1 Tb file whose lines i,j,v represent the values of a
matrix given as A_{ij} =3D v so I can convert it to a Parquet file. Only so= me
of the rows of A are relevant, so the following code first loads the
triplets are text, splits them into Tuple3[Int, Int, Double], drops triplet= s
whose rows should be skipped, then forms a Tuple2[Int, List[Tuple2[Int,
Double]]] for each row (if I'm judging datatypes correctly).

val valsrows =3D sc.textFile(valsinpath).map(_.split(",")).
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 map(x =3D> (x(1).toInt, (x(0).toInt,
x(2).toDouble))).
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 filter(x =3D> !droprows.contains(x._1)).
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 groupByKey.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 map(x =3D> (x._1, x._2.toSeq.sortBy(_._1)))

Spark hangs during a broadcast that occurs during the filter step (accordin= g
to the Spark UI). The last two lines in the log before it pauses are:

memory on 172.31.49.149:37643 (size: 4.6 KB, free: 113.8 GB)
memory on 172.31.49.159:41846 (size: 4.6 KB, free: 113.8 GB)

I've left Spark running for up to 17 minutes one time, and it never
continues past this point. I'm using a cluster of 30 r3.8xlarge EC2
instances (244Gb, 32 cores) with spark in standalone mode with 220G executo= r
and driver memory, and using the kyroserializer.

Any ideas on what could be causing this hang?

