spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brad Willard (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-4778) PySpark Json and groupByKey broken
Date Sat, 10 Jan 2015 17:04:34 GMT

    [ https://issues.apache.org/jira/browse/SPARK-4778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272591#comment-14272591
] 

Brad Willard commented on SPARK-4778:
-------------------------------------

You can close as can not reproduce. I've already moved to 1.2

—
Sent from Mailbox

On Sun, Dec 28, 2014 at 5:49 PM, Josh Rosen (JIRA) <jira@apache.org>



> PySpark Json and groupByKey broken
> ----------------------------------
>
>                 Key: SPARK-4778
>                 URL: https://issues.apache.org/jira/browse/SPARK-4778
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.1.1
>         Environment: ec2 cluster launched from ec2 script
> pyspark
> c3.2xlarge 6 nodes
> hadoop major version 1
>            Reporter: Brad Willard
>
> When I run a groupByKey it seems to create a single tasks after the groupByKey that never
stops executing. I'm loading a smallish json dataset that is 4 million records. This is the
code I'm running. 
> rdd = sql_context.jsonFile(hdfs_uri) 
> rdd = rdd.cache() 
> grouped = rdd.map(lambda row: (row.id, row)).groupByKey(160) 
> grouped.take(1) 
> The groupByKey stage takes a few minutes which I'd expect. However the take operation
never completes. It it hands indefinitely.
> This is what it looks like in UI
> http://cl.ly/image/2k1t3I253T0x
> The only work around I have at the moment is to run a map operation after I loaded from
json to convert all the Row objects to python dictionary objects and then things work although
the map operation is expensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message