sc.textFile("filename").map(_.split(",")).filter(arr => arr.length == 3 && arr(2).toDouble > 50).collect this will give you a Array[Array[String]] do as you may wish with it. And please read through abt RDD


On 5 Sep 2016 8:51 pm, "Ashok Kumar" <ashok34668@yahoo.com> wrote:
Thanks everyone.

I am not skilled like you gentlemen

This is what I did

1) Read the text file

val textFile = sc.textFile("/tmp/myfile.txt")

2) That produces an RDD of String.

3) Create a DF after splitting the file into an Array 

val df = textFile.map(line => line.split(",")).map(x=>(x(0).toInt,x(1).toString,x(2).toDouble)).toDF

4) Create a class for column headers

 case class Columns(col1: Int, col2: String, col3: Double)

5) Assign the column headers 

val h = df.map(p => Columns(p(0).toString.toInt, p(1).toString, p(2).toString.toDouble))

6) Only interested in column 3 > 50

 h.filter(col("Col3") > 50.0)

7) Now I just want Col3 only

h.filter(col("Col3") > 50.0).select("col3").show(5)
+-----------------+
|             col3|
+-----------------+
|95.42536350467836|
|61.56297588648554|
|76.73982017179868|
|68.86218120274728|
|67.64613810115105|
+-----------------+
only showing top 5 rows

Does that make sense. Are there shorter ways gurus? Can I just do all this on RDD without DF?

Thanking you







On Monday, 5 September 2016, 15:19, ayan guha <guha.ayan@gmail.com> wrote:


Then, You need to refer third term in the array, convert it to your desired data type and then use filter. 


On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar <ashok34668@yahoo.com> wrote:
Hi,
I want to filter them for values.

This is what is in array

74,20160905-133143,98. 11218069128827594148

I want to filter anything > 50.0 in the third column

Thanks




On Monday, 5 September 2016, 15:07, ayan guha <guha.ayan@gmail.com> wrote:


Hi

x.split returns an array. So, after first map, you will get RDD of arrays. What is your expected outcome of 2nd map? 

On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar <ashok34668@yahoo.com.invalid> wrote:
Thank you sir.

This is what I get

scala> textFile.map(x=> x.split(","))
res52: org.apache.spark.rdd.RDD[ Array[String]] = MapPartitionsRDD[27] at map at <console>:27

How can I work on individual columns. I understand they are strings

scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))
     | )
<console>:27: error: value getString is not a member of Array[String]
       textFile.map(x=> x.split(",")).map(x => (x.getString(0))

regards




On Monday, 5 September 2016, 13:51, Somasundaram Sekar <somasundar.sekar@ tigeranalytics.com> wrote:


Basic error, you get back an RDD on transformations like map.
sc.textFile("filename").map(x => x.split(",")

On 5 Sep 2016 6:19 pm, "Ashok Kumar" <ashok34668@yahoo.com.invalid> wrote:
Hi,

I have a text file as below that I read in

74,20160905-133143,98. 11218069128827594148
75,20160905-133143,49. 52776998815916807742
76,20160905-133143,56. 08029957123980984556
77,20160905-133143,46. 63689526544407522777
78,20160905-133143,84. 88227141164402181551
79,20160905-133143,68. 72408602520662115000

val textFile = sc.textFile("/tmp/mytextfile. txt")

Now I want to split the rows separated by ","

scala> textFile.map(x=>x.toString). split(",")
<console>:27: error: value split is not a member of org.apache.spark.rdd.RDD[ String]
       textFile.map(x=>x.toString). split(",")

However, the above throws error?

Any ideas what is wrong or how I can do this if I can avoid converting it to String?

Thanking






--
Best Regards,
Ayan Guha





--
Best Regards,
Ayan Guha