spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sim <...@swoop.com>
Subject Cleanup when tasks generate errors
Date Sat, 18 Jul 2015 01:14:34 GMT
I've observed a number of cases where Spark does not clean HDFS side-effects
on errors, especially out of memory conditions. Here is an example from the
following code snippet executed in spark-shell:
import org.apache.spark.sql.hive.HiveContextimport
org.apache.spark.sql.SaveModeval ctx =
sqlContext.asInstanceOf[HiveContext]import ctx.implicits._ctx. 
jsonFile("file:///test_data/*/*/*/*.gz").  saveAsTable("test_data",
SaveMode.Overwrite)
First run: saveAsTable terminates with an out of memory exception.
Second run (with more RAM to driver & executor): fails with many variations
of java.lang.RuntimeException:
hdfs://localhost:54310/user/hive/warehouse/test_data/_temporary/0/_temporary/attempt_201507171538_0008_r_000021_0/part-r-00022.parquet
is not a Parquet file (too small)
Third run (after hdfs dfs -rm -r hdfs:///user/hive/warehouse/test_data)
succeeds.
What are the best practices for dealing with these types of cleanup
failures? Do they tend to come in known varieties?
Thanks,
Sim




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cleanup-when-tasks-generate-errors-tp23890.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Mime
View raw message