spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: [RDDs and Dataframes] Equivalent expressions for RDD API
Date Sun, 05 Mar 2017 10:17:53 GMT
Just as best practice, dataframe and datasets are preferred way, so try not
to resort to rdd unless you absolutely have to...

On Sun, 5 Mar 2017 at 7:10 pm, khwunchai jaengsawang <khwunchai.j@ku.th>
wrote:

> Hi Old-Scool,
>
>
> For the first question, you can specify the number of partition in any
> DataFrame by using
> repartition(numPartitions: Int, partitionExprs: Column*).
> *Example*:
> val partitioned = data.repartition(numPartitions=10).cache()
>
> For your second question, you can transform your RDD into PairRDD and use
> reduceByKey()
> *Example:*
> val pairs = data.map(row => (row(1), row(2)).reduceByKey(_+_)
>
>
> Best,
>
>   Khwunchai Jaengsawang
>   *Email*: khwunchai.j@ku.th
>   LinkedIn <https://linkedin.com/in/khwunchai> | Github
> <https://github.com/khwunchai>
>
>
> On Mar 4, 2560 BE, at 8:59 PM, Old-School <giorgos_myrianthous@outlook.com>
> wrote:
>
> Hi,
>
> I want to perform some simple transformations and check the execution time,
> under various configurations (e.g. number of cores being used, number of
> partitions etc). Since it is not possible to set the partitions of a
> dataframe , I guess that I should probably use RDDs.
>
> I've got a dataset with 3 columns as shown below:
>
> val data = file.map(line => line.split(" "))
>              .filter(lines => lines.length == 3) // ignore first line
>              .map(row => (row(0), row(1), row(2)))
>              .toDF("ID", "word-ID", "count")
> results in:
>
> +------+------------+---------+
> | ID     |  word-ID   |  count   |
> +------+------------+---------+
> |  15   |    87          |   151    |
> |  20   |    19          |   398    |
> |  15   |    19          |   21      |
> |  180 |    90          |   190    |
> +-------------------+---------+
> So how can I turn the above into an RDD in order to use e.g.
> sc.parallelize(data, 10) and set the number of partitions to say 10?
>
> Furthermore, I would also like to ask about the equivalent expression
> (using
> RDD API) for the following simple transformation:
>
> data.select("word-ID",
> "count").groupBy("word-ID").agg(sum($"count").as("count")).show()
>
>
>
> Thanks in advance
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-and-Dataframes-Equivalent-expressions-for-RDD-API-tp28455.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
> --
Best Regards,
Ayan Guha

Mime
View raw message