Looks like it was my own fault. I had spark 2.0 cloned/built, but had the spark shell in my path so somehow 1.6.1 was being used instead of 2.0. Thanks

On Sat, Jun 18, 2016 at 1:16 AM, Takeshi Yamamuro <linguin.m.s@gmail.com> wrote:
which version you use?
I passed in 2.0-preview as follows;
---

Spark context available as 'sc' (master = local[*], app id = local-1466234043659).

Spark session available as 'spark'.

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-preview

      /_/

         

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_31)

Type in expressions to have them evaluated.

Type :help for more information.


scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS

hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0

ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]

scala> ds.groupBy($"_1").count.select($"_1", $"count").show

+---+-----+

| _1|count|

+---+-----+

|  1|    1|

|  2|    1|

+---+-----+



On Sat, Jun 18, 2016 at 3:09 PM, Pedro Rodriguez <ski.rodriguez@gmail.com> wrote:
I went ahead and downloaded/compiled Spark 2.0 to try your code snippet Takeshi. It unfortunately doesn't compile.

scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]

scala> ds.groupBy($"_1").count.select($"_1", $"count").show
<console>:28: error: type mismatch;
 found   : org.apache.spark.sql.ColumnName
 required: org.apache.spark.sql.TypedColumn[(org.apache.spark.sql.Row, Long),?]
              ds.groupBy($"_1").count.select($"_1", $"count").show
                                                                 ^

I also gave a try to Xinh's suggestion using the code snippet below (partially from spark docs)
scala> val ds = Seq(Person("Andy", 32), Person("Andy", 2), Person("Pedro", 24), Person("Bob", 42)).toDS()
scala> ds.groupBy(_.name).count.select($"name".as[String]).show
org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [];
scala> ds.groupBy(_.name).count.select($"_1".as[String]).show
org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [];
scala> ds.groupBy($"name").count.select($"_1".as[String]).show
org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns: [];

Looks like there are empty columns for some reason, the code below works fine for the simple aggregate
scala> ds.groupBy(_.name).count.show

Would be great to see an idiomatic example of using aggregates like these mixed with spark.sql.functions.

Pedro

On Fri, Jun 17, 2016 at 9:59 PM, Pedro Rodriguez <ski.rodriguez@gmail.com> wrote:
Thanks Xinh and Takeshi,

I am trying to avoid map since my impression is that this uses a Scala closure so is not optimized as well as doing column-wise operations is.

Looks like the $ notation is the way to go, thanks for the help. Is there an explanation of how this works? I imagine it is a method/function with its name defined as $ in Scala?

Lastly, are there prelim Spark 2.0 docs? If there isn't a good description/guide of using this syntax I would be willing to contribute some documentation.

Pedro

On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro <linguin.m.s@gmail.com> wrote:
Hi,

In 2.0, you can say;
val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
ds.groupBy($"_1").count.select($"_1", $"count").show


// maropu


On Sat, Jun 18, 2016 at 7:53 AM, Xinh Huynh <xinh.huynh@gmail.com> wrote:
Hi Pedro,

In 1.6.1, you can do:
>> ds.groupBy(_.uid).count().map(_._1)
or
>> ds.groupBy(_.uid).count().select($"value".as[String])

It doesn't have the exact same syntax as for DataFrame. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset

It might be different in 2.0.

Xinh

On Fri, Jun 17, 2016 at 3:33 PM, Pedro Rodriguez <ski.rodriguez@gmail.com> wrote:
Hi All,

I am working on using Datasets in 1.6.1 and eventually 2.0 when its released.

I am running the aggregate code below where I have a dataset where the row has a field uid:

ds.groupBy(_.uid).count()
// res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: string, _2: bigint]

This works as expected, however, attempts to run select statements after fails:
ds.groupBy(_.uid).count().select(_._1)
// error: missing parameter type for expanded function ((x$2) => x$2._1)
ds.groupBy(_.uid).count().select(_._1)

I have tried several variants, but nothing seems to work. Below is the equivalent Dataframe code which works as expected:
df.groupBy("uid").count().select("uid")

Thanks!
--
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni






--
---
Takeshi Yamamuro



--
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni





--
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni





--
---
Takeshi Yamamuro



--
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni