spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Rodriguez <ski.rodrig...@gmail.com>
Subject Re: Dataset Select Function after Aggregate Error
Date Sat, 18 Jun 2016 03:59:12 GMT
Thanks Xinh and Takeshi,

I am trying to avoid map since my impression is that this uses a Scala
closure so is not optimized as well as doing column-wise operations is.

Looks like the $ notation is the way to go, thanks for the help. Is there
an explanation of how this works? I imagine it is a method/function with
its name defined as $ in Scala?

Lastly, are there prelim Spark 2.0 docs? If there isn't a good
description/guide of using this syntax I would be willing to contribute
some documentation.

Pedro

On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro <linguin.m.s@gmail.com>
wrote:

> Hi,
>
> In 2.0, you can say;
> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
> ds.groupBy($"_1").count.select($"_1", $"count").show
>
>
> // maropu
>
>
> On Sat, Jun 18, 2016 at 7:53 AM, Xinh Huynh <xinh.huynh@gmail.com> wrote:
>
>> Hi Pedro,
>>
>> In 1.6.1, you can do:
>> >> ds.groupBy(_.uid).count().map(_._1)
>> or
>> >> ds.groupBy(_.uid).count().select($"value".as[String])
>>
>> It doesn't have the exact same syntax as for DataFrame.
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
>>
>> It might be different in 2.0.
>>
>> Xinh
>>
>> On Fri, Jun 17, 2016 at 3:33 PM, Pedro Rodriguez <ski.rodriguez@gmail.com
>> > wrote:
>>
>>> Hi All,
>>>
>>> I am working on using Datasets in 1.6.1 and eventually 2.0 when its
>>> released.
>>>
>>> I am running the aggregate code below where I have a dataset where the
>>> row has a field uid:
>>>
>>> ds.groupBy(_.uid).count()
>>> // res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: string,
>>> _2: bigint]
>>>
>>> This works as expected, however, attempts to run select statements after
>>> fails:
>>> ds.groupBy(_.uid).count().select(_._1)
>>> // error: missing parameter type for expanded function ((x$2) => x$2._1)
>>> ds.groupBy(_.uid).count().select(_._1)
>>>
>>> I have tried several variants, but nothing seems to work. Below is the
>>> equivalent Dataframe code which works as expected:
>>> df.groupBy("uid").count().select("uid")
>>>
>>> Thanks!
>>> --
>>> Pedro Rodriguez
>>> PhD Student in Distributed Machine Learning | CU Boulder
>>> UC Berkeley AMPLab Alumni
>>>
>>> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
>>> Github: github.com/EntilZha | LinkedIn:
>>> https://www.linkedin.com/in/pedrorodriguezscience
>>>
>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>



-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Mime
View raw message