spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Hayden <>
Subject pyspark error with zip
Date Tue, 31 Mar 2015 15:27:36 GMT

The following program fails in the zip step.

x = sc.parallelize([1, 2, 3, 1, 2, 3])
y = sc.parallelize([1, 2, 3])
z = x.distinct()

The error that is produced depends on whether multiple partitions have been specified or not.

I understand that

the two RDDs [must] have the same number of partitions and the same number of elements in
each partition.

What is the best way to work around this restriction?

I have been performing the operation with the following code, but I am hoping to find something
more efficient.

def safe_zip(left, right):
    ix_left = left.zipWithIndex().map(lambda row: (row[1], row[0]))
    ix_right = right.zipWithIndex().map(lambda row: (row[1], row[0]))
    return ix_left.join(ix_right).sortByKey().values()

View raw message