spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergey <ser...@gmail.com>
Subject strange behavior of pyspark RDD zip
Date Fri, 01 Apr 2016 18:08:06 GMT
Hi!

I'm on Spark 1.6.1 in local mode on Windows.

And have issue with zip of zip'pping of two RDDs of __equal__ size and
__equal__ partitions number (I also tried to repartition both RDDs to one
partition).
I get such exception when I do rdd1.zip(rdd2).count():

File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main
  File "c:\spark\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process
  File "c:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line
263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "c:\spark\python\pyspark\rddsampler.py", line 95, in func
    for obj in iterator:
  File "c:\spark\python\lib\pyspark.zip\pyspark\serializers.py", line
322, in load_stream
    " in pair: (%d, %d)" % (len(keys), len(vals)))
ValueError: Can not deserialize RDD with different number of items in
pair: (256, 512)

Mime
View raw message