spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From khwunchai jaengsawang <khwuncha...@ku.th>
Subject Re: [RDDs and Dataframes] Equivalent expressions for RDD API
Date Sun, 05 Mar 2017 08:10:13 GMT
Hi Old-Scool,


For the first question, you can specify the number of partition in any DataFrame by using
repartition(numPartitions: Int, partitionExprs: Column*).
Example:
	val partitioned = data.repartition(numPartitions=10).cache()

For your second question, you can transform your RDD into PairRDD and use reduceByKey()
Example:
	val pairs = data.map(row => (row(1), row(2)).reduceByKey(_+_)


Best,

  Khwunchai Jaengsawang
  Email: khwunchai.j@ku.th
  LinkedIn <https://linkedin.com/in/khwunchai> | Github <https://github.com/khwunchai>


> On Mar 4, 2560 BE, at 8:59 PM, Old-School <giorgos_myrianthous@outlook.com> wrote:
> 
> Hi,
> 
> I want to perform some simple transformations and check the execution time,
> under various configurations (e.g. number of cores being used, number of
> partitions etc). Since it is not possible to set the partitions of a
> dataframe , I guess that I should probably use RDDs. 
> 
> I've got a dataset with 3 columns as shown below:
> 
> val data = file.map(line => line.split(" "))
>              .filter(lines => lines.length == 3) // ignore first line
>              .map(row => (row(0), row(1), row(2)))
>              .toDF("ID", "word-ID", "count")
> results in:
> 
> +------+------------+---------+
> | ID     |  word-ID   |  count   |
> +------+------------+---------+
> |  15   |    87          |   151    |
> |  20   |    19          |   398    |
> |  15   |    19          |   21      |
> |  180 |    90          |   190    |
> +-------------------+---------+
> So how can I turn the above into an RDD in order to use e.g.
> sc.parallelize(data, 10) and set the number of partitions to say 10? 
> 
> Furthermore, I would also like to ask about the equivalent expression (using
> RDD API) for the following simple transformation:
> 
> data.select("word-ID",
> "count").groupBy("word-ID").agg(sum($"count").as("count")).show()
> 
> 
> 
> Thanks in advance
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-and-Dataframes-Equivalent-expressions-for-RDD-API-tp28455.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 


Mime
View raw message