spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lalwani, Jayesh" <Jayesh.Lalw...@capitalone.com>
Subject Re: [Spark 2.x Core] .collect() size limit
Date Mon, 30 Apr 2018 15:36:34 GMT
Although there is such a thing as virtualization of memory done at the OS layer, JVM imposes
it’s own limit that is controlled by the spark.executor.memory and spark.driver.memory configurations.
The amount of memory allocated by JVM will be controlled by those parameters. General guidelines
say that executor and driver memory should be kept at 80-85% of available RAM. So, if general
guidelines are followed, *virtual memory* is moot.
From: Deepak Goel <deicool@gmail.com>
Date: Saturday, April 28, 2018 at 12:58 PM
To: Stephen Boesch <javadba@gmail.com>
Cc: klrmowse <klrmowse@gmail.com>, "user @spark" <user@spark.apache.org>
Subject: Re: [Spark 2.x Core] .collect() size limit

I believe the virtualization of memory happens at the OS layer hiding it completely from the
application layer

On Sat, 28 Apr 2018, 22:22 Stephen Boesch, <javadba@gmail.com<mailto:javadba@gmail.com>>
wrote:
While it is certainly possible to use VM I have seen in a number of places warnings that collect()
results must be able to be fit in memory. I'm not sure if that applies to *all" spark calculations:
but in the very least each of the specific collect()'s that are performed would need to be
verified.

And maybe all collects do require sufficient memory - would you like to check the source code
to see if there were disk backed collects actually happening for some cases?

2018-04-28 9:48 GMT-07:00 Deepak Goel <deicool@gmail.com<mailto:deicool@gmail.com>>:
There is something as *virtual memory*

On Sat, 28 Apr 2018, 21:19 Stephen Boesch, <javadba@gmail.com<mailto:javadba@gmail.com>>
wrote:
Do you have a machine with  terabytes of RAM?  afaik collect() requires RAM - so that would
be your limiting factor.

2018-04-28 8:41 GMT-07:00 klrmowse <klrmowse@gmail.com<mailto:klrmowse@gmail.com>>:
i am currently trying to find a workaround for the Spark application i am
working on so that it does not have to use .collect()

but, for now, it is going to have to use .collect()

what is the size limit (memory for the driver) of RDD file that .collect()
can work with?

i've been scouring google-search - S.O., blogs, etc, and everyone is
cautioning about .collect(), but does not specify how huge is huge... are we
talking about a few gigabytes? terabytes?? petabytes???



thank you



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/<https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Duser-2Dlist.1001560.n3.nabble.com_&d=DwMFaQ&c=pLULRYW__RtkwsQUPxJVDGboCTdgji3AcHNJU0BpTJE&r=F2RNeGILvLdBxn7RJ4effes_QFIiEsoVM2rPi9qX1DKow5HQSjq0_WhIW109SXQ4&m=5LYtB_tQbPNzr4wqcwOP6XqPSef2zJRufNimgqXUCYA&s=iXh4776YwilYUo2ouANkz0T-Gn6uOli8kqYrR1Lr_2o&e=>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<mailto:user-unsubscribe@spark.apache.org>


________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One
and/or its affiliates and may only be used solely in performance of work or services for Capital
One. The information transmitted herewith is intended only for use by the individual or entity
to which it is addressed. If the reader of this message is not the intended recipient, you
are hereby notified that any review, retransmission, dissemination, distribution, copying
or other use of, or taking of any action in reliance upon this information is strictly prohibited.
If you have received this communication in error, please contact the sender and delete the
material from your computer.
Mime
View raw message