spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: cache table vs. parquet table performance
Date Wed, 16 Jan 2019 12:47:10 GMT
I believe the in-memory solution misses the storage indexes that parquet / orc have.

The in-memory solution is more suitable if you iterate in the whole set of data frequently.

> Am 15.01.2019 um 19:20 schrieb Tomas Bartalos <tomas.bartalos@gmail.com>:
> 
> Hello,
> 
> I'm using spark-thrift server and I'm searching for best performing solution to query
hot set of data. I'm processing records with nested structure, containing subtypes and arrays.
1 record takes up several KB.
> 
> I tried to make some improvement with cache table:
> cache table event_jan_01 as select * from events where day_registered = 20190102;
> 
> If I understood correctly, the data should be stored in in-memory columnar format with
storage level MEMORY_AND_DISK. So data which doesn't fit to memory will be spille to disk
(I assume also in columnar format (?))
> I cached 1 day of data (1 M records) and according to spark UI storage tab none of the
data was cached to memory and everything was spilled to disk. The size of the data was 5.7
GB.
> Typical queries took ~ 20 sec.
> 
> Then I tried to store the data to parquet format:
> CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02" as 
> select * from event_jan_01;
> 
> The whole parquet took up only 178MB.
> And typical queries took 5-10 sec.
> 
> Is it possible to tune spark to spill the cached data in parquet format ?
> Why the whole cached table was spilled to disk and nothing stayed in memory ?
> 
> Spark version: 2.4.0
> 
> Best regards,
> Tomas
> 

Mime
View raw message