spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deepak Goel <deic...@gmail.com>
Subject Re: [Spark 2.x Core] .collect() size limit
Date Mon, 30 Apr 2018 16:15:54 GMT
Could you please help us and provide the source which says about the
general guidelines (80-85)?

Even if there is a general guideline, it is probably to keep the
performance of Spark application high (And to *distinguish* it from
Hadoop). But if you are not too concerned about the *performance* hit from
memory to disk, then you could use virtual memory to your advantage. Infact
I think the OS could do a pretty good job of data management by keeping
only the necessary data in RAM and at the same time having no hard-limit
(It would be great to have benchmarks if anyone has done any test before)

Also we should *tread* carefully when applying general guidelines to
problems. They might not be *relevant* at all.

Deepak
"Please stop cruelty to Animals, help by becoming a Vegan"
+91 73500 12833
deicool@gmail.com

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

Made In India : http://www.makeinindia.com/home

On Mon, Apr 30, 2018 at 9:06 PM, Lalwani, Jayesh <
Jayesh.Lalwani@capitalone.com> wrote:

> Although there is such a thing as virtualization of memory done at the OS
> layer, JVM imposes it’s own limit that is controlled by the *spark.executor.memory
> *and *spark.driver.memory* configurations. The amount of memory allocated
> by JVM will be controlled by those parameters. General guidelines say that
> executor and driver memory should be kept at 80-85% of available RAM. So,
> if general guidelines are followed, **virtual memory** is moot.
>
> *From: *Deepak Goel <deicool@gmail.com>
> *Date: *Saturday, April 28, 2018 at 12:58 PM
> *To: *Stephen Boesch <javadba@gmail.com>
> *Cc: *klrmowse <klrmowse@gmail.com>, "user @spark" <user@spark.apache.org>
> *Subject: *Re: [Spark 2.x Core] .collect() size limit
>
>
>
> I believe the virtualization of memory happens at the OS layer hiding it
> completely from the application layer
>
>
>
> On Sat, 28 Apr 2018, 22:22 Stephen Boesch, <javadba@gmail.com> wrote:
>
> While it is certainly possible to use VM I have seen in a number of places
> warnings that collect() results must be able to be fit in memory. I'm not
> sure if that applies to *all" spark calculations: but in the very least
> each of the specific collect()'s that are performed would need to be
> verified.
>
>
>
> And maybe *all *collects do require sufficient memory - would you like to
> check the source code to see if there were disk backed collects actually
> happening for some cases?
>
>
>
> 2018-04-28 9:48 GMT-07:00 Deepak Goel <deicool@gmail.com>:
>
> There is something as *virtual memory*
>
>
>
> On Sat, 28 Apr 2018, 21:19 Stephen Boesch, <javadba@gmail.com> wrote:
>
> Do you have a machine with  terabytes of RAM?  afaik collect() requires
> RAM - so that would be your limiting factor.
>
>
>
> 2018-04-28 8:41 GMT-07:00 klrmowse <klrmowse@gmail.com>:
>
> i am currently trying to find a workaround for the Spark application i am
> working on so that it does not have to use .collect()
>
> but, for now, it is going to have to use .collect()
>
> what is the size limit (memory for the driver) of RDD file that .collect()
> can work with?
>
> i've been scouring google-search - S.O., blogs, etc, and everyone is
> cautioning about .collect(), but does not specify how huge is huge... are
> we
> talking about a few gigabytes? terabytes?? petabytes???
>
>
>
> thank you
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Duser-2Dlist.1001560.n3.nabble.com_&d=DwMFaQ&c=pLULRYW__RtkwsQUPxJVDGboCTdgji3AcHNJU0BpTJE&r=F2RNeGILvLdBxn7RJ4effes_QFIiEsoVM2rPi9qX1DKow5HQSjq0_WhIW109SXQ4&m=5LYtB_tQbPNzr4wqcwOP6XqPSef2zJRufNimgqXUCYA&s=iXh4776YwilYUo2ouANkz0T-Gn6uOli8kqYrR1Lr_2o&e=>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
>
>
>
>
> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>

Mime
View raw message