spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From patcharee <Patcharee.Thong...@uni.no>
Subject write multiple outputs by key
Date Sat, 06 Jun 2015 15:07:46 GMT
Hi,

How can I write to multiple outputs for each key? I tried to create 
custom partitioner or define the number of partition but does not work. 
There are only the few tasks/partitions (which equals to the number of 
all key combination) gets large datasets, data is not splitting to all 
tasks/partition. The job failed as the few tasks handled too far large 
datasets. Below is my code snippet.

val varWFlatRDD = 
varWRDD.map(FlatMapUtilClass().flatKeyFromWrf).groupByKey() //key are 
(zone, z, year, month)
     .foreach(
         x => {
           val z = x._1._1
           val year = x._1._2
           val month = x._1._3
           val df_table_4dim = x._2.toList.toDF()
           df_table_4dim.registerTempTable("table_4Dim")
           hiveContext.sql("INSERT OVERWRITE table 4dim partition 
(zone=" + ZONE + ",z=" + z + ",year=" + year + ",month=" + month + ") " +
             "select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, 
qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim");
       })


 From the spark history UI, at groupByKey there are > 1000 tasks (equals 
to the parent's # partitions). at foreach there are > 1000 tasks as 
well, but 50 tasks (same as the # all key combination)  gets datasets.

How can I fix this problem? Any suggestions are appreciated.

BR,
Patcharee



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message