spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apostolos N. Papadopoulos" <papad...@csd.auth.gr>
Subject Re: Spark job's driver programe consums too much memory
Date Fri, 07 Sep 2018 14:15:17 GMT
Dear James,

- check the Spark documentation to see the actions that return a lot of 
data back to the driver. One of these actions is collect(). However, 
take(x) is an action, also reduce() is an action.

Before executing collect() find out what is the size of your RDD/DF.

- I cannot understand the phrase "hdfs directly from the executor". You 
can specify an hdfs file as your input and also you can use hdfs to 
store your output.


regards,

Apostolos



On 07/09/2018 05:04 μμ, James Starks wrote:
> I have a Spark job that read data from database. By increasing submit 
> parameter '--driver-memory 25g' the job can works without a problem 
> locally but not in prod env because prod master do not have enough 
> capacity.
>
> So I have a few questions:
>
> -  What functions such as collecct() would cause the data to be sent 
> back to the driver program?
>   My job so far merely uses `as`, `filter`, `map`, and `filter`.
>
> - Is it possible to write data (in parquet format for instance) to 
> hdfs directly from the executor? If so how can I do (any code snippet, 
> doc for reference, or what keyword to search cause can't find by e.g. 
> `spark direct executor hdfs write`)?
>
> Thanks
>
>
>
>

-- 
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papadopo@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message