spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pavel Velikhov <pavel.velik...@gmail.com>
Subject Re: Spark job fails on cluster but works fine on a single machine
Date Fri, 20 Feb 2015 13:05:27 GMT
I definitely delete the file on the right HDFS, I only have one HDFS instance.

The problem seems to be in the CassandraRDD - reading always fails in some way when run on
the cluster, but single-machine reads are okay.



> On Feb 20, 2015, at 4:20 AM, Ilya Ganelin <ilganeli@gmail.com> wrote:
> 
> The stupid question is whether you're deleting the file from hdfs on the right node?
> On Thu, Feb 19, 2015 at 11:31 AM Pavel Velikhov <pavel.velikhov@gmail.com <mailto:pavel.velikhov@gmail.com>>
wrote:
> Yeah, I do manually delete the files, but it still fails with this error.
> 
>> On Feb 19, 2015, at 8:16 PM, Ganelin, Ilya <Ilya.Ganelin@capitalone.com <mailto:Ilya.Ganelin@capitalone.com>>
wrote:
>> 
>> When writing to hdfs Spark will not overwrite existing files or directories. You
must either manually delete these or use Java's Hadoop FileSystem class to remove them.
>> 
>> 
>> 
>> Sent with Good (www.good.com <http://www.good.com/>)
>> 
>> 
>> -----Original Message-----
>> From: Pavel Velikhov [pavel.velikhov@gmail.com <mailto:pavel.velikhov@gmail.com>]
>> Sent: Thursday, February 19, 2015 11:32 AM Eastern Standard Time
>> To: user@spark.apache.org <mailto:user@spark.apache.org>
>> Subject: Spark job fails on cluster but works fine on a single machine
>> 
>> I have a simple Spark job that goes out to Cassandra, runs a pipe and stores results:
>> 
>> val sc = new SparkContext(conf)
>> val rdd = sc.cassandraTable(“keyspace", “table")
>>       .map(r => r.getInt(“column") + "\t" + write(get_lemmas(r.getString("tags"))))
>>       .pipe("python3 /tmp/scripts_and_models/scripts/run.py")
>>       .map(r => convertStr(r) )
>>       .coalesce(1,true)
>>       .saveAsTextFile("/tmp/pavel/CassandraPipeTest.txt")
>>       //.saveToCassandra(“keyspace", “table", SomeColumns(“id”,"data”))
>> 
>> When run on a single machine, everything is fine if I save to an hdfs file or save
to Cassandra.
>> When run in cluster neither works:
>> 
>>  - When saving to file, I get an exception: User class threw exception: Output directory
hdfs://hadoop01:54310/tmp/pavel/CassandraPipeTest.txt <> already exists
>>  - When saving to Cassandra, only 4 rows are updated with empty data (I test on a
4-machine Spark cluster)
>> 
>> Any hints on how to debug this and where the problem could be?
>> 
>> - I delete the hdfs file before running
>> - Would really like the output to hdfs to work, so I can debug
>> - Then it would be nice to save to Cassandra
>> 
>> The information contained in this e-mail is confidential and/or proprietary to Capital
One and/or its affiliates. The information transmitted herewith is intended only for use by
the individual or entity to which it is addressed.  If the reader of this message is not the
intended recipient, you are hereby notified that any review, retransmission, dissemination,
distribution, copying or other use of, or taking of any action in reliance upon this information
is strictly prohibited. If you have received this communication in error, please contact the
sender and delete the material from your computer.
> 


Mime
View raw message