spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-2790) PySpark zip() doesn't work properly if RDDs have different serializers
Date Mon, 11 Aug 2014 18:44:11 GMT

    [ https://issues.apache.org/jira/browse/SPARK-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093133#comment-14093133
] 

Apache Spark commented on SPARK-2790:
-------------------------------------

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/1894

> PySpark zip() doesn't work properly if RDDs have different serializers
> ----------------------------------------------------------------------
>
>                 Key: SPARK-2790
>                 URL: https://issues.apache.org/jira/browse/SPARK-2790
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.0.0, 1.1.0
>            Reporter: Josh Rosen
>            Assignee: Davies Liu
>            Priority: Critical
>
> In PySpark, attempting to {{zip()}} two RDDs may fail if the RDDs have different serializers
(e.g. batched vs. unbatched), even if those RDDs have the same number of partitions and same
numbers of elements.  This problem occurs in the MLlib Python APIs, where we might want to
zip a JavaRDD of LabelledPoints with a JavaRDD of batch-serialized Python objects.
> This is problematic because whether zip() succeeds or errors depends on the partitioning
/ batching strategy, and we don't want to surface the serialization details to users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message