spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Old-School <giorgos_myrianth...@outlook.com>
Subject [RDDs and Dataframes] Equivalent expressions for RDD API
Date Sat, 04 Mar 2017 13:59:31 GMT
Hi,

I want to perform some simple transformations and check the execution time,
under various configurations (e.g. number of cores being used, number of
partitions etc). Since it is not possible to set the partitions of a
dataframe , I guess that I should probably use RDDs. 

I've got a dataset with 3 columns as shown below:

val data = file.map(line => line.split(" "))
              .filter(lines => lines.length == 3) // ignore first line
              .map(row => (row(0), row(1), row(2)))
              .toDF("ID", "word-ID", "count")
results in:

+------+------------+---------+
| ID     |  word-ID   |  count   |
+------+------------+---------+
|  15   |    87          |   151    |
|  20   |    19          |   398    |
|  15   |    19          |   21      |
|  180 |    90          |   190    |
+-------------------+---------+
So how can I turn the above into an RDD in order to use e.g.
sc.parallelize(data, 10) and set the number of partitions to say 10? 

Furthermore, I would also like to ask about the equivalent expression (using
RDD API) for the following simple transformation:

data.select("word-ID",
"count").groupBy("word-ID").agg(sum($"count").as("count")).show()



Thanks in advance



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-and-Dataframes-Equivalent-expressions-for-RDD-API-tp28455.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message