spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rzykov <>
Subject Re: Optimizing text file parsing, many small files versus few big files
Date Thu, 20 Nov 2014 08:51:36 GMT
You could use combineTextFile  from
It combines input files before mappers by means of Hadoop
CombineFileInputFormat. In our case it reduced the number of mappers from
100000 to approx 3000 and made job significantly faster. 

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

import ru.retailrocket.spark.multitool.Loaders._

object Tst{
    def main(args: Array[String]) ={
    val conf = new SparkConf().setMaster("local").setAppName("My App")
    val sc = new SparkContext("local", "My App")

    val path = "file:///test/*"
    val sessions = sc.combineTextFile(path)
  // or val sessions = sc.combineTextFile(path, size = 256, delim = "\n")
  // where size is split size in Megabytes, delim - line break string


View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message