spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Ogren <>
Subject Re: rdd.saveAsTextFile problem
Date Thu, 02 Jan 2014 17:51:52 GMT
Not really.  In practice I write everything out to HDFS and that is 
working fine.  But I write lots of unit tests and example scripts and it 
is convenient to be able to test a Spark application (or sequence of 
spark functions) in a very local way such that it doesn't depend on any 
outside infrastructure (e.g. an HDFS server.) So, it is convenient to 
write out a small amount of data locally and manually inspect the 
results - esp. as I'm building up a unit or regression test.

So, ultimately writing results out to a local file isn't that important 
to me.  However, I was just trying to run a simple example script that 
worked before and is now not working.


On 1/2/2014 10:28 AM, Andrew Ash wrote:
> You want to write it to a local file on the machine?  Try using 
> "file:///path/to/target/mydir/" instead
> I'm not sure what behavior would be if you did this on a multi-machine 
> cluster though -- you may get a bit of data on each machine in that 
> local directory.
> On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren < 
> <>> wrote:
>     I have a very simple Spark application that looks like the following:
>     var myRdd: RDD[Array[String]] = initMyRdd()
>     println(myRdd.first.mkString(", "))
>     println(myRdd.count)
>     myRdd.saveAsTextFile("hdfs://myserver:8020/mydir")
>     myRdd.saveAsTextFile("target/mydir/")
>     The println statements work as expected.  The first saveAsTextFile
>     statement also works as expected.  The second saveAsTextFile
>     statement does not (even if the first is commented out.)  I get
>     the exception pasted below.  If I inspect "target/mydir" I see
>     that there is a directory called
>     _temporary/0/_temporary/attempt_201401020953_0000_m_000000_1 which
>     contains an empty part-00000 file.  It's curious because this code
>     worked before with Spark 0.8.0 and now I am running on Spark
>     0.8.1. I happen to be running this on Windows in "local" mode at
>     the moment.  Perhaps I should try running it on my linux box.
>     Thanks,
>     Philip
>     Exception in thread "main" org.apache.spark.SparkException: Job
>     aborted: Task 2.0:0 failed more than 0 times; aborting job
>     java.lang.NullPointerException
>         at
>     org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827)
>         at
>     org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825)
>         at
>     scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
>         at
>     scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at
>     org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825)
>         at
>     org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440)
>         at
>     <>$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502)
>         at
>     org.apache.spark.scheduler.DAGScheduler$$anon$

View raw message