spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Su She <suhsheka...@gmail.com>
Subject Re: Getting Output From a Cluster
Date Fri, 09 Jan 2015 07:41:28 GMT
1) Thank you everyone for the help once again...the support here is really
amazing and I hope to contribute soon!

2) The solution I actually ended up using was from this thread:
http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCAFNzJ5EJxdGQJu7NbdQY6XUReq3D1pCXr+i2S99g5brCj5EEyA@mail.gmail.com%3E

in case the thread ever goes down, the soln provided by Matei:

plans.saveAsHadoopFiles("hdfs://localhost:8020/user/hue/output/completed","csv",
String.class, String.class, (Class) TextOutputFormat.class);

I had browsed a lot of similar threads that did not have answers, but found
this one from quite some time ago, so apologize for posting a question that
had been answered before.

3) Akhil, I was specifying the format as "txt", but it was not compatible

Thanks for the help!


On Thu, Jan 8, 2015 at 11:23 PM, Akhil Das <akhil@sigmoidanalytics.com>
wrote:

> saveAsHadoopFiles requires you to specify the output format which i
> believe you are not specifying anywhere and hence the program crashes.
>
> You could try something like this:
>
> Class<? extends OutputFormat<?,?>> outputFormatClass = (Class<? extends
> OutputFormat<?,?>>) (Class<?>) SequenceFileOutputFormat.class;
> 46
>
> yourStream.saveAsNewAPIHadoopFiles(hdfsUrl, "/output-location",Text.class,
> Text.class, outputFormatClass);
>
>
>
> Thanks
> Best Regards
>
> On Fri, Jan 9, 2015 at 10:22 AM, Su She <suhshekar52@gmail.com> wrote:
>
>> Yes, I am calling the saveAsHadoopFiles on the Dstream. However, when I
>> call print on the Dstream it works? If I had to do foreachRDD to
>> saveAsHadoopFile, then why is it working for print?
>>
>> Also, if I am doing foreachRDD, do I need connections, or can I simply
>> put the saveAsHadoopFiles inside the foreachRDD function?
>>
>> Thanks Yana for the help! I will play around with foreachRDD and convey
>> my results.
>>
>>
>>
>> On Thu, Jan 8, 2015 at 6:06 PM, Yana Kadiyska <yana.kadiyska@gmail.com>
>> wrote:
>>
>>> are you calling the saveAsText files on the DStream --looks like it?
>>> Look at the section called "Design Patterns for using foreachRDD" in the
>>> link you sent -- you want to do  dstream.foreachRDD(rdd =>
>>> rdd.saveAs....)
>>>
>>> On Thu, Jan 8, 2015 at 5:20 PM, Su She <suhshekar52@gmail.com> wrote:
>>>
>>>> Hello Everyone,
>>>>
>>>> Thanks in advance for the help!
>>>>
>>>> I successfully got my Kafka/Spark WordCount app to print locally.
>>>> However, I want to run it on a cluster, which means that I will have to
>>>> save it to HDFS if I want to be able to read the output.
>>>>
>>>> I am running Spark 1.1.0, which means according to this document:
>>>> https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html
>>>>
>>>> I should be able to use commands such as saveAsText/HadoopFiles.
>>>>
>>>> 1) When I try saveAsTextFiles it says:
>>>> cannot find symbol
>>>> [ERROR] symbol  : method
>>>> saveAsTextFiles(java.lang.String,java.lang.String)
>>>> [ERROR] location: class
>>>> org.apache.spark.streaming.api.java.JavaPairDStream<java.lang.String,java.lang.Integer>
>>>>
>>>> This makes some sense as saveAsTextFiles is not included here:
>>>>
>>>> http://people.apache.org/~tdas/spark-1.1.0-temp-docs/api/java/org/apache/spark/streaming/api/java/JavaPairDStream.html
>>>>
>>>> 2) When I try
>>>> saveAsHadoopFiles("hdfs://ip....us-west-1.compute.internal:8020/user/testwordcount",
>>>> "txt") it builds, but when I try running it it throws this exception:
>>>>
>>>> Exception in thread "main" java.lang.RuntimeException:
>>>> java.lang.RuntimeException: class scala.runtime.Nothing$ not
>>>> org.apache.hadoop.mapred.OutputFormat
>>>>         at
>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2079)
>>>>         at
>>>> org.apache.hadoop.mapred.JobConf.getOutputFormat(JobConf.java:712)
>>>>         at
>>>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1021)
>>>>         at
>>>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940)
>>>>         at
>>>> org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$8.apply(PairDStreamFunctions.scala:632)
>>>>         at
>>>> org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$8.apply(PairDStreamFunctions.scala:630)
>>>>         at
>>>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
>>>>         at
>>>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>>>>         at
>>>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>>>>         at scala.util.Try$.apply(Try.scala:161)
>>>>         at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
>>>>         at
>>>> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:171)
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>         at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>         at java.lang.Thread.run(Thread.java:724)
>>>> Caused by: java.lang.RuntimeException: class scala.runtime.Nothing$ not
>>>> org.apache.hadoop.mapred.OutputFormat
>>>>         at
>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2073)
>>>>         ... 14 more
>>>>
>>>>
>>>> Any help is really appreciated! Thanks.
>>>>
>>>>
>>>
>>
>

Mime
View raw message