I am not skilled like you gentlemen

This is what I did

1) Read the text file

val textFile = sc.textFile("/tmp/myfile.txt")

2) That produces an RDD of String.

3) Create a DF after splitting the file into an Array 

val df = => line.split(",")).map(x=>(x(0).toInt,x(1).toString,x(2).toDouble)).toDF

4) Create a class for column headers

 case class Columns(col1: Int, col2: String, col3: Double)

5) Assign the column headers 

val h = => Columns(p(0).toString.toInt, p(1).toString, p(2).toString.toDouble))

6) Only interested in column 3 > 50

 h.filter(col("Col3") > 50.0)

7) Now I just want Col3 only

h.filter(col("Col3") > 50.0).select("col3").show(5)
|             col3|
only showing top 5 rows

Does that make sense. Are there shorter ways gurus? Can I just do all this on RDD without DF?

Then, You need to refer third term in the array, convert it to your desired data type and then use filter. 

I want to filter them for values.

This is what is in array

74,20160905-133143,98. 11218069128827594148

I want to filter anything > 50.0 in the third column


x.split returns an array. So, after first map, you will get RDD of arrays. What is your expected outcome of 2nd map? 

This is what I get

scala>> x.split(","))
res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at map at <console>:27

How can I work on individual columns. I understand they are strings

scala>> x.split(",")).map(x => (x.getString(0))
     | )
<console>:27: error: value getString is not a member of Array[String]> x.split(",")).map(x => (x.getString(0))


Basic error, you get back an RDD on transformations like map.
sc.textFile("filename").map(x => x.split(",")

I have a text file as below that I read in

74,20160905-133143,98. 11218069128827594148
75,20160905-133143,49. 52776998815916807742
76,20160905-133143,56. 08029957123980984556
77,20160905-133143,46. 63689526544407522777
78,20160905-133143,84. 88227141164402181551
79,20160905-133143,68. 72408602520662115000

val textFile = sc.textFile("/tmp/mytextfile. txt")

Now I want to split the rows separated by ","

scala>>x.toString). split(",")
<console>:27: error: value split is not a member of org.apache.spark.rdd.RDD[ String]>x.toString). split(",")

However, the above throws error?

Any ideas what is wrong or how I can do this if I can avoid converting it to String?


