spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Memory usage by Spark jobs
Date Thu, 22 Sep 2016 08:13:59 GMT
You should take also into account that spark has different option to represent data in-memory,
such as Java serialized objects, Kyro serialized, Tungsten (columnar optionally compressed)
etc. the tungsten thing depends heavily on the underlying data and sorting especially if compressed.
Then, you might think also about broadcasted data etc.

As such I am not aware of a specific guide, but there is also no magic behind it. could be
a good jira task :) 

> On 22 Sep 2016, at 08:36, Hemant Bhanawat <hemant9379@gmail.com> wrote:
> 
> I am working on profiling TPCH queries for Spark 2.0.  I see lot of temporary object
creation (sometimes size as much as the data size) which is justified for the kind of processing
Spark does. But, from production perspective, is there a guideline on how much memory should
be allocated for processing a specific data size of let's say parquet data? Also, has someone
investigated memory usage for the individual SQL operators like Filter, group by, order by,
Exchange etc.? 
> 
> Hemant Bhanawat
> www.snappydata.io 

Mime
View raw message