spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marius Soutier <>
Subject Re: Processing of text file in large gzip archive
Date Mon, 16 Mar 2015 10:49:00 GMT

> 1. I don't think textFile is capable of unpacking a .gz file. You need to use hadoopFile
or newAPIHadoop file for this.

Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do is compute
splits on gz files, so if you have a single file, you'll have a single partition.

Processing 30 GB of gzipped data should not take that long, at least with the Scala API. Python
not sure, especially under 1.2.1.

View raw message