spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Rodriguez <ski.rodrig...@gmail.com>
Subject Re: dataframe.foreach VS dataframe.collect().foreach
Date Tue, 26 Jul 2016 13:28:41 GMT
:)

Just realized you didn't get your original question answered though:

scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> case class Person(age: Long, name: String)
defined class Person

scala> val df = Seq(Person(24, "pedro"), Person(22, "fritz")).toDF()
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> df.select("age")
res2: org.apache.spark.sql.DataFrame = [age: bigint]

scala> df.select("age").collect.map(_.getLong(0))
res3: Array[Long] = Array(24, 22)

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> df.collect.flatMap {
     | case Row(age: Long, name: String) => Seq(Tuple1(age))
     | case _ => Seq()
     | }
res7: Array[(Long,)] = Array((24,), (22,))

These docs are helpful
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row
(1.6 docs, but should be similar in 2.0)

On Tue, Jul 26, 2016 at 7:08 AM, Gourav Sengupta <gourav.sengupta@gmail.com>
wrote:

> And Pedro has made sense of a world running amok, scared, and drunken
> stupor.
>
> Regards,
> Gourav
>
> On Tue, Jul 26, 2016 at 2:01 PM, Pedro Rodriguez <ski.rodriguez@gmail.com>
> wrote:
>
>> I am not 100% as I haven't tried this out, but there is a huge difference
>> between the two. Both foreach and collect are actions irregardless of
>> whether or not the data frame is empty.
>>
>> Doing a collect will bring all the results back to the driver, possibly
>> forcing it to run out of memory. Foreach will apply your function to each
>> element of the DataFrame, but will do so across the cluster. This behavior
>> is useful for when you need to do something custom for each element
>> (perhaps save to a db for which there is no driver or something custom like
>> make an http request per element, careful here though due to overhead cost).
>>
>> In your example, I am going to assume that hrecords is something like a
>> list buffer. The reason that will be empty is that each worker will get
>> sent an empty list (its captured in the closure for foreach) and append to
>> it. The instance of the list at the driver doesn't know about what happened
>> at the workers so its empty.
>>
>> I don't know why Chanh's comment applies here since I am guessing the df
>> is not empty.
>>
>> On Tue, Jul 26, 2016 at 1:53 AM, kevin <kiss.kevin119@gmail.com> wrote:
>>
>>> thank you Chanh
>>>
>>> 2016-07-26 15:34 GMT+08:00 Chanh Le <giaosudau@gmail.com>:
>>>
>>>> Hi Ken,
>>>>
>>>> *blacklistDF -> just DataFrame *
>>>> Spark is lazy until you call something like* collect, take, write* it
>>>> will execute the hold process *like you do map or filter before you
>>>> collect*.
>>>> That mean until you call collect spark* do nothing* so you df would
>>>> not have any data -> can’t call foreach.
>>>> Call collect execute the process -> get data -> foreach is ok.
>>>>
>>>>
>>>> On Jul 26, 2016, at 2:30 PM, kevin <kiss.kevin119@gmail.com> wrote:
>>>>
>>>>  blacklistDF.collect()
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>


-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Mime
View raw message