spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Starks <suse...@protonmail.com.INVALID>
Subject Re: Spark job's driver programe consums too much memory
Date Fri, 07 Sep 2018 14:39:42 GMT

Is df.write.mode(...).parquet("hdfs://..") also actions function? Checking doc shows that
my spark doesn't use those actions functions. But saveXXXX functions looks resembling the
function df.write.mode(overwrite).parquet("hdfs://path/to/parquet-file") used by my spark
job uses. Therefore I am thinking maybe that's the reason why my spark job driver consumes
such amount of memory.

https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#actions

My spark job's driver program consumes too much memory, so I want to prevent that by writing
data to hdfs at the executor side, instead of waiting those data to be sent back to the driver
program (then writing to hdfs). This is because our worker servers have bigger memory size
than the one that runs driver program. If I can write data to hdfs at executor, then the driver
memory for my spark job can be reduced.

Otherwise does Spark support streaming read from database (i.e. spark streaming + spark sql)?

Thanks for your reply.



‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On 7 September 2018 4:15 PM, Apostolos N. Papadopoulos <papadopo@csd.auth.gr> wrote:

> Dear James,
>
> -   check the Spark documentation to see the actions that return a lot of
>     data back to the driver. One of these actions is collect(). However,
>     take(x) is an action, also reduce() is an action.
>
>     Before executing collect() find out what is the size of your RDD/DF.
>
> -   I cannot understand the phrase "hdfs directly from the executor". You
>     can specify an hdfs file as your input and also you can use hdfs to
>     store your output.
>
>     regards,
>
>     Apostolos
>
>     On 07/09/2018 05:04 μμ, James Starks wrote:
>
>
> > I have a Spark job that read data from database. By increasing submit
> > parameter '--driver-memory 25g' the job can works without a problem
> > locally but not in prod env because prod master do not have enough
> > capacity.
> > So I have a few questions:
> > -  What functions such as collecct() would cause the data to be sent
> > back to the driver program?
> >   My job so far merely uses `as`, `filter`, `map`, and `filter`.
> >
> > -   Is it possible to write data (in parquet format for instance) to
> >     hdfs directly from the executor? If so how can I do (any code snippet,
> >     doc for reference, or what keyword to search cause can't find by e.g.
> >     `spark direct executor hdfs write`)?
> >
> >
> > Thanks
>
> --
>
> Apostolos N. Papadopoulos, Associate Professor
> Department of Informatics
> Aristotle University of Thessaloniki
> Thessaloniki, GREECE
> tel: ++0030312310991918
> email: papadopo@csd.auth.gr
> twitter: @papadopoulos_ap
> web: http://datalab.csd.auth.gr/~apostol
>
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message