spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rzykov <rzy...@gmail.com>
Subject Re: Optimizing text file parsing, many small files versus few big files
Date Thu, 20 Nov 2014 08:51:36 GMT
You could use combineTextFile  from
https://github.com/RetailRocket/SparkMultiTool
It combines input files before mappers by means of Hadoop
CombineFileInputFormat. In our case it reduced the number of mappers from
100000 to approx 3000 and made job significantly faster. 

Example:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

import ru.retailrocket.spark.multitool.Loaders._

object Tst{
    def main(args: Array[String]) ={
    val conf = new SparkConf().setMaster("local").setAppName("My App")
    val sc = new SparkContext("local", "My App")

    val path = "file:///test/*"
    val sessions = sc.combineTextFile(path)
  // or val sessions = sc.combineTextFile(path, size = 256, delim = "\n")
  // where size is split size in Megabytes, delim - line break string

    println(sessions.count())
   }
}




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-text-file-parsing-many-small-files-versus-few-big-files-tp19266p19354.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message