spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 大啊 <>
Subject Re:Re:cache table vs. parquet table performance
Date Wed, 16 Jan 2019 04:26:35 GMT
So I think cache large data is not a best practice.

At 2019-01-16 12:24:34, "大啊" <> wrote:

Hi ,Tomas.
Thanks for your question give me some prompt.But the best way use cache usually stores smaller
I think cache large data will consume memory or disk space too much.
Spill the cached data in parquet format maybe a good improvement.

At 2019-01-16 02:20:56, "Tomas Bartalos" <> wrote:


I'm using spark-thrift server and I'm searching for best performing solution to query hot
set of data. I'm processing records with nested structure, containing subtypes and arrays.
1 record takes up several KB.

I tried to make some improvement with cache table:

cache table event_jan_01 asselect * from events where day_registered = 20190102;

If I understood correctly, the data should be stored in in-memory columnar format with storage
level MEMORY_AND_DISK. So data which doesn't fit to memory will be spille to disk (I assume
also in columnar format (?))
I cached 1 day of data (1 M records) and according to spark UI storage tab none of the data
was cached to memory and everything was spilled to disk. The size of the data was 5.7 GB.
Typical queries took ~ 20 sec.

Then I tried to store the data to parquet format:

CREATETABLE event_jan_01_par USING parquet location "/tmp/events/jan/02"as 

select * from event_jan_01;

The whole parquet took up only 178MB.
And typical queries took 5-10 sec.

Is it possible to tune spark to spill the cached data in parquet format ?
Why the whole cached table was spilled to disk and nothing stayed in memory ?

Spark version: 2.4.0

Best regards,

View raw message