spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gtinside <gtins...@gmail.com>
Subject Group by specific key and save as parquet
Date Tue, 01 Sep 2015 00:27:33 GMT
Hi ,

I have a set of data, I need to group by specific key and then save as
parquet. Refer to the code snippet below. I am querying trade and then
grouping by date

val df = sqlContext.sql("SELECT * FROM trade") 
val dfSchema = df.schema 
val partitionKeyIndex = dfSchema.fieldNames.seq.indexOf("date") 
//group by date
val groupedByPartitionKey = df.rdd.groupBy { row =>
row.getString(partitionKeyIndex) }
val failure = groupedByPartitionKey.map(row => {
val rowDF = sqlContext.createDataFrame(sc.parallelize(row._2.toSeq),
dfSchema) 
val fileName = config.getTempFileName(row._1) 
try { 
        val dest = new Path(fileName) 
        if(DefaultFileSystem.getFS.exists(dest)) { 
            DefaultFileSystem.getFS.delete(dest, true) 
         } 
         rowDF.saveAsParquetFile(fileName) 
    } catch { 
           case e : Throwable => { 
                                        logError("Failed to save parquet
file")
           } 
       failure = true 
   }

This code doesn't work well because of NestedRDD , what is the best way to
solve this problem?

Regards,
Gaurav



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Group-by-specific-key-and-save-as-parquet-tp24527.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message