spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "deenar.toraskar" <>
Subject Re: How to save as a single file efficiently?
Date Sat, 22 Mar 2014 00:50:28 GMT

Apologies for hijacking this thread.


On the subject of processing lots (millions) of small input files on HDFS,
what are the best practices to follow on spark. Currently my code looks
something like this. Without coalesce there is one task and one output file
per input file. But putting coalesce in reduces the output files. I have
used mapValues as the map step preserves partitioning.Do I need coalesce
before the first map as well?

val dataRDD = sc.newAPIHadoopRDD(conf,
classOf[], classOf[]) 
val data  = => (row._1.toString,
Try(rawSdosParser(row._2.toString(), null)))).coalesce(100)
val datatoLoad=  data.filter(_._2.isSuccess).mapValues(value => value match
{ case Success ( s) => Try(s.iterator.toList)})
val datatoSave=  datatoLoad.filter(_._2.isSuccess).mapValues(value => value
match { case Success(s) => s} )


View this message in context:
Sent from the Apache Spark User List mailing list archive at

View raw message