spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish Rangole <arang...@gmail.com>
Subject Re: Support for gz files ?
Date Mon, 21 Oct 2013 13:13:28 GMT
Ramkumar,

Being able to read gzip files and gzip files not being splittable are two
different and orthogonal things and are both correct.

Spark uses HDFS apis to read from HDFS and uses the codec available to
decompress them.

However, it is the gzip format itself that doesn't lend itself to be split
among multiple mappers in a meaningful way.

Now if you're gzipped files aren't too small (<<1Gb) or too large (>> 10GB,
say) and there are a lot of them,  then it should be okay. If not, then you
could use Snappy compression, if you have that flexibility, since Snappy
compressed files are splittable.

Hope this helps.
On Oct 21, 2013 12:59 AM, "Ramkumar Chokkalingam" <ramkumar.au@gmail.com>
wrote:

> Hello group,
>
> Am having .gz files as part of my input and when reading on the support
> for gzip files, I stumbled upon this thread on StackOverflow <http://stackoverflow.com/questions/16302385/gzip-support-in-spark/16309699#16309699>
which
> says that Spark supports gz files. But a few days back I saw a mail thread
>  here in the group pointing to this link<https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-4/compression#8ca1fda1252b67145680b3a5e9d45b2a>
and
> claiming that *Spark does not handle .gz files as they are not splittable*.
>
>
> These two items seems to be ambiguous. Can anyone confirm on the real
> scenario ? Thanks!
>
> Regards,
>
> Ramkumar Chokkalingam ,
> University of Washington.
> LinkedIn <http://www.linkedin.com/in/mynameisram>
>
>
>

Mime
View raw message