spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: rdd.saveAsTextFile problem
Date Thu, 02 Jan 2014 18:39:17 GMT
I'm guessing it's a documentation issue, but certainly something could have
broken.

- what version of Spark?  -- 0.8.1
- what mode are you running with? (local, standalone, mesos, YARN) -- local
on Windows
- are you using the shell or a application - shell?
- what language (scala / java / Python) - scala

Can you provide a deeper error stacktrace from the executor?  Look in the
webui (port 4040) and in the stdout/stderr files.

Also, give it a shot on the linux box to see if that works.

Cheers!
Andrew


On Thu, Jan 2, 2014 at 1:31 PM, Philip Ogren <philip.ogren@oracle.com>wrote:

>  Yep - that works great and is what I normally do.
>
> I perhaps should have framed my email as a bug report.  The documentation
> for saveAsTextFile says you can write results out to a local file but it
> doesn't work for me per the described behavior.  It also worked before and
> now it doesn't.  So, it seems like a bug.  Should I file a Jira issue?  I
> haven't done that yet for this project but would be happy to.
>
> Thanks,
> Philip
>
>
> On 1/2/2014 11:23 AM, Andrew Ash wrote:
>
> For testing, maybe try using .collect and doing the comparison between
> expected and actual in memory rather than on disk?
>
>
> On Thu, Jan 2, 2014 at 12:54 PM, Philip Ogren <philip.ogren@oracle.com>wrote:
>
>>  I just tried your suggestion and get the same results with the
>> _temporary directory.  Thanks though.
>>
>>
>> On 1/2/2014 10:28 AM, Andrew Ash wrote:
>>
>> You want to write it to a local file on the machine?  Try using
>> "file:///path/to/target/mydir/" instead
>>
>>  I'm not sure what behavior would be if you did this on a multi-machine
>> cluster though -- you may get a bit of data on each machine in that local
>> directory.
>>
>>
>> On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren <philip.ogren@oracle.com>wrote:
>>
>>> I have a very simple Spark application that looks like the following:
>>>
>>>
>>> var myRdd: RDD[Array[String]] = initMyRdd()
>>> println(myRdd.first.mkString(", "))
>>> println(myRdd.count)
>>>
>>> myRdd.saveAsTextFile("hdfs://myserver:8020/mydir")
>>> myRdd.saveAsTextFile("target/mydir/")
>>>
>>>
>>> The println statements work as expected.  The first saveAsTextFile
>>> statement also works as expected.  The second saveAsTextFile statement does
>>> not (even if the first is commented out.)  I get the exception pasted
>>> below.  If I inspect "target/mydir" I see that there is a directory called
>>> _temporary/0/_temporary/attempt_201401020953_0000_m_000000_1 which contains
>>> an empty part-00000 file.  It's curious because this code worked before
>>> with Spark 0.8.0 and now I am running on Spark 0.8.1. I happen to be
>>> running this on Windows in "local" mode at the moment.  Perhaps I should
>>> try running it on my linux box.
>>>
>>> Thanks,
>>> Philip
>>>
>>>
>>> Exception in thread "main" org.apache.spark.SparkException: Job aborted:
>>> Task 2.0:0 failed more than 0 times; aborting job
>>> java.lang.NullPointerException
>>>     at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827)
>>>     at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825)
>>>     at
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
>>>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>     at
>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825)
>>>     at
>>> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440)
>>>     at org.apache.spark.scheduler.DAGScheduler.org
>>> $apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502)
>>>     at
>>> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)
>>>
>>>
>>>
>>
>>
>
>

Mime
View raw message