spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bin Fan <fanbin...@gmail.com>
Subject Re: cache table vs. parquet table performance
Date Thu, 18 Apr 2019 05:34:40 GMT
Hi Tomas,

One option is to cache your table as Parquet files into Alluxio (which can
serve as an in-memory distributed caching layer for Spark in your case).

The code on Spark will be like

> df.write.parquet("alluxio://master:19998/data.parquet")> df = sqlContext.read.parquet("alluxio://master:19998/data.parquet")

(See more details at the documentation
http://www.alluxio.org/docs/1.8/en/compute/Spark.html
<http://www.alluxio.org/docs/1.8/en/compute/Spark.html#cache-dataframe-into-alluxio?utm_source=spark>
)

This would require running Alluxio as a separate service (ideally colocated
with Spark servers), of course.
But also enables data sharing across Spark jobs.

- Bin




On Tue, Jan 15, 2019 at 10:29 AM Tomas Bartalos <tomas.bartalos@gmail.com>
wrote:

> Hello,
>
> I'm using spark-thrift server and I'm searching for best performing
> solution to query hot set of data. I'm processing records with nested
> structure, containing subtypes and arrays. 1 record takes up several KB.
>
> I tried to make some improvement with cache table:
>
> cache table event_jan_01 as select * from events where day_registered =
> 20190102;
>
>
> If I understood correctly, the data should be stored in *in-memory
> columnar* format with storage level MEMORY_AND_DISK. So data which
> doesn't fit to memory will be spille to disk (I assume also in columnar
> format (?))
> I cached 1 day of data (1 M records) and according to spark UI storage tab
> none of the data was cached to memory and everything was spilled to disk.
> The size of the data was *5.7 GB.*
> Typical queries took ~ 20 sec.
>
> Then I tried to store the data to parquet format:
>
> CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02"
> as
>
> select * from event_jan_01;
>
>
> The whole parquet took up only *178MB.*
> And typical queries took 5-10 sec.
>
> Is it possible to tune spark to spill the cached data in parquet format ?
> Why the whole cached table was spilled to disk and nothing stayed in
> memory ?
>
> Spark version: 2.4.0
>
> Best regards,
> Tomas
>
>

Mime
View raw message