With Spark, the processing is performed lazily. This means nothing much is really happening until you call an "action" - an example that is collect(). Another way is to write the output in a distributed manner - see write.df() in R.

With SparkR dapply() passing the data from Spark to R to process by your UDF could have significant overhead. Could you provide more information on your case?


_____________________________
From: Xiao Liu1 <liuxiao@us.ibm.com>
Sent: Wednesday, January 18, 2017 11:30 AM
Subject: what does dapply actually do?
To: <user@spark.apache.org>


Hi,
I'm really new and trying to learn sparkR. I have defined a relatively complicated user-defined function, and use dapply() to apply the function on a SparkDataFrame. It was very fast. But I am not sure what has actually been done by dapply(). Because when I used collect() to see the output, which is very simple, it took a long time to get the result. I suppose maybe I don't need to use collect(), but without using it, how can I output the final results, say, in a .csv file?
Thank you very much for the help.

Best Regards,
Xiao


Inactive hide details for Ninad Shringarpure ---01/18/2017 02:24:08 PM---Hi Team, Is there a standard way of generating a uniquNinad Shringarpure ---01/18/2017 02:24:08 PM---Hi Team, Is there a standard way of generating a unique id for each row in from

From: Ninad Shringarpure <ninad@cloudera.com>
To: user <user@spark.apache.org>
Date: 01/18/2017 02:24 PM
Subject: Creating UUID using SparksSQL





Hi Team,

Is there a standard way of generating a unique id for each row in from Spark SQL. I am looking for functionality similar to UUID generation in hive.

Let me know if you need any additional information.

Thanks,
Ninad