spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Surbhit <>
Subject Spark Sql reading whole table from cache instead of required coulmns
Date Tue, 13 Jan 2015 09:46:31 GMT

I am using spark 1.1.0.
I am using the spark-sql shell to run all the below queries.

I have created an external parquet table using the below SQL:

create external table daily (<15 column names>)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
LOCATION '<parquet file location>';

Then I cache the table using the following set of commands:

set spark.sql.inMemoryColumnarStorage.compressed=true;
cache table daily;
select count(*) from daily; 

The in-memory size of this table after caching is ~40 G. Complete table gets
cached in memory.

Now when I run a simple query which involves only one of the 15 columns of
this table, the whole table(~40 G) is read from the cache instead of just
one column as shown by the spark web UI. A sample query that I fired after
caching the table is:

select count(distinct col1) from daily;

I expect that only the required column should be read from the cache as the
data is stored in columnar format in cache.

Can someone please tell me if my expectation is correct. And if yes, than
what am I missing here, any configuration or something which will give me
the desired result.

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message