spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ayoub Benali <benali.ayoub.i...@gmail.com>
Subject Re: Parquet compression codecs not applied
Date Sat, 10 Jan 2015 13:49:36 GMT
it worked thanks.

this doc page
<https://spark.apache.org/docs/1.2.0/sql-programming-guide.html>recommends
to use "spark.sql.parquet.compression.codec" to set the compression coded
and I thought this setting would be forwarded to the hive context given
that HiveContext extends SQLContext, but it was not.

I am wondering if this behavior is normal, if not I could open an issue
with a potential fix so that "spark.sql.parquet.compression.codec" would be
translated to "parquet.compression" in the hive context ?

Or the documentation should be updated to mention that the compression
coded is set differently with HiveContext.

Ayoub.



2015-01-09 17:51 GMT+01:00 Michael Armbrust <michael@databricks.com>:

> This is a little confusing, but that code path is actually going through
> hive.  So the spark sql configuration does not help.
>
> Perhaps, try:
> set parquet.compression=GZIP;
>
> On Fri, Jan 9, 2015 at 2:41 AM, Ayoub <benali.ayoub.info@gmail.com> wrote:
>
>> Hello,
>>
>> I tried to save a table created via the hive context as a parquet file but
>> whatever compression codec (uncompressed, snappy, gzip or lzo) I set via
>> setConf like:
>>
>> setConf("spark.sql.parquet.compression.codec", "gzip")
>>
>> the size of the generated files is the always the same, so it seems like
>> spark context ignores the compression codec that I set.
>>
>> Here is a code sample applied via the spark shell:
>>
>> import org.apache.spark.sql.hive.HiveContext
>> val hiveContext = new HiveContext(sc)
>>
>> hiveContext.sql("SET hive.exec.dynamic.partition = true")
>> hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
>> hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") //
>> required
>> to make data compatible with impala
>> hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip")
>>
>> hiveContext.sql("create external table if not exists foo (bar STRING, ts
>> INT) Partitioned by (year INT, month INT, day INT) STORED AS PARQUET
>> Location 'hdfs://path/data/foo'")
>>
>> hiveContext.sql("insert into table foo partition(year, month,day) select
>> *,
>> year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as month,
>> day(from_unixtime(ts)) as day from raw_foo")
>>
>> I tried that with spark 1.2 and 1.3 snapshot against hive 0.13
>> and I also tried that with Impala on the same cluster which applied
>> correctly the compression codecs.
>>
>> Does anyone know what could be the problem ?
>>
>> Thanks,
>> Ayoub.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Mime
View raw message