spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grega KeŇ°pret <>
Subject Large input file problem
Date Sat, 12 Oct 2013 23:51:29 GMT

I'm getting Java OOM (Heap, GC overhead exceeded), Futures timed out after
[10000] milliseconds, removing BlockManager with no recent heartbeat etc. I
have narrowed down the cause to be a big input file from S3. I'm trying to
make Spark split this file to several smaller chunks, so each of these
chunks will fit in memory, but I'm out of luck.

I have tried:
- passing minSplits parameter to something greater than 1 in sc.textFile
- increasing parameter numPartitions to groupByKey
- using coalesce with numPartitions greater than 1 and shuffle = true

Basically my flow is like this:
val input = sc.textFile("s3n://.../input.gz", minSplits)
  .mapPartitions(l => (key, l))

If I do input.toDebugString, I always have 1 partition (even if the
minSplits is greater than 1). It seems like Spark is trying to ingest the
whole input at once. When I manually split the file into several smaller
ones, I was able to progress successfully, and input.toDebugString was
showing 10 partitions in case of 10 files.



View raw message