spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vijay Parmar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-15000) Spark hangs indefinitely if you cache a dataframe, then show it, then do some further processing on it
Date Sat, 14 May 2016 02:16:12 GMT

    [ https://issues.apache.org/jira/browse/SPARK-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15283379#comment-15283379
] 

Vijay Parmar commented on SPARK-15000:
--------------------------------------

I ran the example code provided by you on Scala 1.6.1 and below is the output of that :-

scala> df2.cache
res8: df2.type = [_1: bigint, _2: bigint]

scala> df2.show
+---+---+
| _1| _2|
+---+---+
|  0|467|
|  1|315|
|  2|436|
|  3|193|
|  4|162|
|  5|495|
|  6|397|
|  7|223|
|  8|245|
|  9| 71|
| 10|  3|
| 11|464|
| 12|222|
| 13|471|
| 14|379|
| 15| 22|
| 16|176|
| 17| 79|
| 18| 82|
| 19|230|
+---+---+
only showing top 20 rows


scala> val groupresult=df2.groupBy("_2").agg(count("_1") as "count")
groupresult: org.apache.spark.sql.DataFrame = [_2: bigint, count: bigint]

scala> groupresult.show
+---+-----+                                                                     
| _2|count|
+---+-----+
| 31|    1|
|231|    2|
|432|    2|
|232|    1|
| 33|    2|
|234|    1|
|434|    2|
| 34|    1|
|435|    3|
| 35|    2|
|436|    2|
|236|    1|
| 37|    1|
|237|    1|
|239|    1|
|439|    1|
| 40|    1|
|240|    2|
|440|    2|
| 41|    1|
+---+-----+
only showing top 20 rows


scala> 


It seems to be an issue with the older version of Scala not with the current version i.e 1.6.1

> Spark hangs indefinitely if you cache a dataframe, then show it, then do some further
processing on it
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-15000
>                 URL: https://issues.apache.org/jira/browse/SPARK-15000
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.5.2, 1.6.0
>         Environment: I am running the test code on both a hortonworks sandbox and also
on AWS EMR / EC2. Issue occurs in both spark-submit and spark-shell
>            Reporter: Jamie Hutton
>
> There seems to be an issue with certain combinations of cache and show when using spark.
If you read a parquet file from disk, cache it, then perform a show operation, the system
will hang (forever) if you perform further processing on it. 
> The following code replicates the issue. I have run it on multiple environments, two
spark versions and in both spark-shell and spark-submit. 
> /*create a dataframe for our test - i did this so the test was self contained but you
can use any parquet format dataframe*/
> val r = scala.util.Random
> val list = (0L to 500L).map(i=>(i,r.nextInt(500).asInstanceOf[Long]))
> val distData = sc.parallelize(list)
> import sqlContext.implicits._
> val df=distData.toDF
> df.write.format("parquet").mode("overwrite").save("df_hanging_test.parquet") 
> /*Now read the dataframe back in -  this is where the test begins*/
> val df2 = sqlContext.read.load("df_hanging_test.parquet")
> df2.cache
> df2.show
> val groupresult=df2.groupBy("_2").agg(count("_1") as "count")
> groupresult.show
> /*the last step hangs forever*/
> If you remove either the df2.cache or the df2.show lines the issue goes away. Also the
groupBy/Agg doesnt seem to be the issue - I believe i have seen the same issue with other
types of processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message