Being able to read gzip files and gzip files not being splittable are two different and orthogonal things and are both correct.

Spark uses HDFS apis to read from HDFS and uses the codec available to
decompress them.

However, it is the gzip format itself that doesn't lend itself to be split among multiple mappers in a meaningful way.

Now if you're gzipped files aren't too small (<<1Gb) or too large (>> 10GB, say) and there are a lot of them,  then it should be okay. If not, then you could use Snappy compression, if you have that flexibility, since Snappy compressed files are splittable.

Hope this helps.

On Oct 21, 2013 12:59 AM, "Ramkumar Chokkalingam" <ramkumar.au@gmail.com> wrote:
Hello group,

Am having .gz files as part of my input and when reading on the support for gzip files, I stumbled upon this thread on StackOverflow  which says that Spark supports gz files. But a few days back I saw a mail thread  here in the group pointing to this link and claiming that Spark does not handle .gz files as they are not splittable.  

These two items seems to be ambiguous. Can anyone confirm on the real scenario ? Thanks!


Ramkumar Chokkalingam ,
University of Washington.