spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sim <>
Subject Cleanup when tasks generate errors
Date Sat, 18 Jul 2015 01:14:34 GMT
I've observed a number of cases where Spark does not clean HDFS side-effects
on errors, especially out of memory conditions. Here is an example from the
following code snippet executed in spark-shell:
import org.apache.spark.sql.hive.HiveContextimport
org.apache.spark.sql.SaveModeval ctx =
sqlContext.asInstanceOf[HiveContext]import ctx.implicits._ctx. 
jsonFile("file:///test_data/*/*/*/*.gz").  saveAsTable("test_data",
First run: saveAsTable terminates with an out of memory exception.
Second run (with more RAM to driver & executor): fails with many variations
of java.lang.RuntimeException:
is not a Parquet file (too small)
Third run (after hdfs dfs -rm -r hdfs:///user/hive/warehouse/test_data)
What are the best practices for dealing with these types of cleanup
failures? Do they tend to come in known varieties?

View this message in context:
Sent from the Apache Spark User List mailing list archive at
View raw message