spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashok Kumar <ashok34...@yahoo.com.INVALID>
Subject Re: Splitting columns from a text file
Date Mon, 05 Sep 2016 15:21:38 GMT
Thanks everyone.
I am not skilled like you gentlemen
This is what I did
1) Read the text file
val textFile = sc.textFile("/tmp/myfile.txt")

2) That produces an RDD of String.
3) Create a DF after splitting the file into an Array 
val df = textFile.map(line => line.split(",")).map(x=>(x(0).toInt,x(1).toString,x(2).toDouble)).toDF
4) Create a class for column headers
 case class Columns(col1: Int, col2: String, col3: Double)
5) Assign the column headers 
val h = df.map(p => Columns(p(0).toString.toInt, p(1).toString, p(2).toString.toDouble))
6) Only interested in column 3 > 50
 h.filter(col("Col3") > 50.0)
7) Now I just want Col3 only
h.filter(col("Col3") > 50.0).select("col3").show(5)+-----------------+|          
  col3|+-----------------+|95.42536350467836||61.56297588648554||76.73982017179868||68.86218120274728||67.64613810115105|+-----------------+only
showing top 5 rows
Does that make sense. Are there shorter ways gurus? Can I just do all this on RDD without
DF?
Thanking you




 

    On Monday, 5 September 2016, 15:19, ayan guha <guha.ayan@gmail.com> wrote:
 

 Then, You need to refer third term in the array, convert it to your desired data type and
then use filter. 

On Tue, Sep 6, 2016 at 12:14 AM, Ashok Kumar <ashok34668@yahoo.com> wrote:

Hi,I want to filter them for values.
This is what is in array
74,20160905-133143,98. 11218069128827594148

I want to filter anything > 50.0 in the third column
Thanks

 

    On Monday, 5 September 2016, 15:07, ayan guha <guha.ayan@gmail.com> wrote:
 

 Hi
x.split returns an array. So, after first map, you will get RDD of arrays. What is your expected
outcome of 2nd map? 
On Mon, Sep 5, 2016 at 11:30 PM, Ashok Kumar <ashok34668@yahoo.com.invalid> wrote:

Thank you sir.
This is what I get
scala> textFile.map(x=> x.split(","))res52: org.apache.spark.rdd.RDD[ Array[String]]
= MapPartitionsRDD[27] at map at <console>:27
How can I work on individual columns. I understand they are strings
scala> textFile.map(x=> x.split(",")).map(x => (x.getString(0))     | )<console>:27:
error: value getString is not a member of Array[String]       textFile.map(x=> x.split(",")).map(x
=> (x.getString(0))
regards

 

    On Monday, 5 September 2016, 13:51, Somasundaram Sekar <somasundar.sekar@ tigeranalytics.com>
wrote:
 

 Basic error, you get back an RDD on transformations like map.sc.textFile("filename").map(x
=> x.split(",") 
On 5 Sep 2016 6:19 pm, "Ashok Kumar" <ashok34668@yahoo.com.invalid> wrote:

Hi,
I have a text file as below that I read in
74,20160905-133143,98. 1121806912882759414875,20160905-133143,49. 5277699881591680774276,20160905-133143,56.
0802995712398098455677,20160905-133143,46. 6368952654440752277778,20160905-133143,84. 8822714116440218155179,20160905-133143,68.
72408602520662115000
val textFile = sc.textFile("/tmp/mytextfile. txt")
Now I want to split the rows separated by ","
scala> textFile.map(x=>x.toString). split(",")<console>:27: error: value split
is not a member of org.apache.spark.rdd.RDD[ String]       textFile.map(x=>x.toString).
split(",")
However, the above throws error?
Any ideas what is wrong or how I can do this if I can avoid converting it to String?
Thanking



   



-- 
Best Regards,
Ayan Guha


   



-- 
Best Regards,
Ayan Guha


   
Mime
View raw message