spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <>
Subject Re: Help with Initial Cluster Configuration / Tuning
Date Tue, 22 Oct 2013 18:08:43 GMT
Yes, that is certainly feasible.  But Chapter
11<>isn't written yet.

On Tue, Oct 22, 2013 at 11:02 AM, Timothy Perrigo <>wrote:

> As the newbie who started the conversation, I'd like to thank everyone for
> the feedback and the subsequent discussion.  I certainly understand the
> point that there's no magic rule book that can take the place of learning
> the ins-and-outs of distributed / cluster computing-- a certain amount of
> pain is to be expected.  I'd like to add, too, that so far, with Spark,
> this pain has been surprisingly minimal, thanks in no small part to the
> information I've gleaned (directly or indirectly) from this mailing list.
> However, any additional information is always welcome.  In my own case,
> what I think I would really benefit from would be a start-to-finish example
> of a problem that works on a large-ish dataset.  In particular, it would be
> helpful to know what parameters have to be considered, what they are set
> to, and the rationale behind how those values were obtained, as well as a
> discussion about determining a "good" cluster size / configuration for the
> example problem.  (In fact, if anyone knows of such an example, I would be
> very appreciative!).  This certainly won't make everything completely
> painless, but would be invaluable and certainly seems feasible.
> Thanks again everyone for you help and advice.
> Tim
> On Tue, Oct 22, 2013 at 12:01 PM, Mark Hamstra <>wrote:
>> Yes, there are certainly rough spots and sharp edges that we can work at
>> polishing out and rounding over; and there are people working on such
>> things.  Don't get me wrong, feedback from users about what they are
>> finding to difficult, opaque or impenetrable is useful; but I don't think
>> that the expectation that working with a framework like Spark should be
>> smooth and easy can be completely met.  Even when all of the documentation,
>> guidance, instrumentation and user interface are in place, there will still
>> be a lot for users to come to terms with.
>> On Tue, Oct 22, 2013 at 9:50 AM, Aaron Davidson <>wrote:
>>> On the other hand, I totally agree that memory usage in Spark is rather
>>> opaque, and is one area where we could do a lot better at in terms of
>>> communicating issues, through both docs and instrumentation. At least with
>>> serialization and such, you can get meaningful exceptions (hopefully), but
>>> OOMs are just blanket "something wasn't right somewhere." Debugging them
>>> empirically would require deep diving into Spark's heap allocations, which
>>> requires a lot more knowledge of Spark internals than should be required
>>> for general usage.
>>> On Tue, Oct 22, 2013 at 9:22 AM, Mark Hamstra <>wrote:
>>>> Yes, but that also illustrates the problem faced by anyone trying to
>>>> write a "little white paper or guide lines" to make newbies' experience
>>>> painless.  Distributed computing clusters are necessarily complex things,
>>>> and problems can crop up in multiple locations, layers or subsystems.  It's
>>>> just not feasible to quickly bring up to speed someone with no experience
>>>> in distributed programming and cluster systems.  It takes a lot of
>>>> knowledge, both broad and deep.  Very few people have the complete scope
>>>> knowledge and experience required, so creating, debugging and maintaining
>>>> cluster computing application almost always has to be a team effort.
>>>> Support organizations and communities can replace some of the need for
>>>> a knowledgeable and well-functioning team, but not all of it; and at some
>>>> point you have to expect that debugging is going to take a considerable
>>>> amount of painstaking, systematic effort -- including a close reading of
>>>> the available docs.
>>>> Several people are working on making more and better reference and
>>>> training material available, and some of that will include trouble-shooting
>>>> guidance, but that doesn't mean that there can ever be "one little paper"
>>>> to solve newbies' (or more experienced developers') problems or provide
>>>> adequate guidance.  There's just too much to cover and too many different
>>>> kinds or levels of initial-user knowledge to make that completely feasible.
>>>> On Tue, Oct 22, 2013 at 8:50 AM, Shay Seng <> wrote:
>>>>> Hey Mark, I didn't mean to say that the information isn't out there --
>>>>> just that when something goes wrong with spark, the scope of what could
>>>>> wrong is so large - some bad setting with JVM, serializer, akka, badly
>>>>> written scala code, algorithm wrong, check worker logs, check executor
>>>>> stderrs, ....
>>>>> When I looked at this post this morning, my initial thought wasn't
>>>>> that "countByValue" would be at fault. ...probably since I've only been
>>>>> using Scala/Spark for a month or so.
>>>>> It was just a suggestion to help newbies come up to speed more quickly
>>>>> and gain insights into how to debug issues.
>>>>> On Tue, Oct 22, 2013 at 8:14 AM, Mark Hamstra <
>>>>> > wrote:
>>>>>> There's no need to guess at that.  The docs tell you directly:
>>>>>> def countByValue(): Map[T, Long]
>>>>>> Return the count of each unique value in this RDD as a map of (value,
>>>>>> count) pairs. The final combine step happens locally on the master,
>>>>>> equivalent to running a single reduce task.
>>>>>> On Tue, Oct 22, 2013 at 7:22 AM, Shay Seng <>
>>>>>>> Hi Matei,
>>>>>>> I've seen several memory tuning queries on this mailing list,
>>>>>>> also heard the same kinds of queries at the spark meetup. In
fact the last
>>>>>>> bullet point in Josh Carver(?) slides, the guy from Bizo, was
>>>>>>> tuning is still a mystery".
>>>>>>> I certainly had lots of issues in when I first started. From
>>>>>>> issues to gc issues, things seem to run fine until you try something
>>>>>>> 500GB of data etc.
>>>>>>> I was wondering if you could write up a little white paper or
>>>>>>> guide lines on how to set memory values, and what to look at
when something
>>>>>>> goes wrong? Eg. I would never gave guessed that countByValue
happens on a
>>>>>>> single machine etc.
>>>>>>>  On Oct 21, 2013 6:18 PM, "Matei Zaharia" <>
>>>>>>> wrote:
>>>>>>>> Hi there,
>>>>>>>> The problem is that countByValue happens in only a single
>>>>>>>> task -- this is probably something we should fix but it's
basically not
>>>>>>>> designed for lots of values. Instead, do the count in parallel
as follows:
>>>>>>>> val counts = => (str, 1)).reduceByKey((a,
b) => a +
>>>>>>>> b)
>>>>>>>> If this still has trouble, you can also increase the level
>>>>>>>> parallelism of reduceByKey by passing it a second parameter
for the number
>>>>>>>> of tasks (e.g. 100).
>>>>>>>> BTW one other small thing with your code, flatMap should
>>>>>>>> work fine if your function returns an Iterator to Traversable,
so there's
>>>>>>>> no need to call toList and return a Seq in ngrams; you can
just return an
>>>>>>>> Iterator[String].
>>>>>>>> Matei
>>>>>>>> On Oct 21, 2013, at 1:05 PM, Timothy Perrigo <>
>>>>>>>> wrote:
>>>>>>>> > Hi everyone,
>>>>>>>> > I am very new to Spark, so as a learning exercise I've
set up a
>>>>>>>> small cluster consisting of 4 EC2 m1.large instances (1 master,
3 slaves),
>>>>>>>> which I'm hoping to use to calculate ngram frequencies from
text files of
>>>>>>>> various sizes (I'm not doing anything with them; I just thought
this would
>>>>>>>> be slightly more interesting than the usual 'word count'
>>>>>>>>  Currently, I'm trying to work with a 1GB text file, but
running into
>>>>>>>> memory issues.  I'm wondering what parameters I should be
setting (in
>>>>>>>> in order to properly utilize the cluster. 
Right now, I'd be
>>>>>>>> happy just to have the process complete successfully with
the 1 gig file,
>>>>>>>> so I'd really appreciate any suggestions you all might have.
>>>>>>>> >
>>>>>>>> > Here's a summary of the code I'm running through the
spark shell
>>>>>>>> on the master:
>>>>>>>> >
>>>>>>>> > def ngrams(s: String, n: Int = 3): Seq[String] = {
>>>>>>>> >   (s.split("\\s+").sliding(n)).filter(_.length ==
>>>>>>>> n).map(_.mkString(" ")).map(_.trim).toList
>>>>>>>> > }
>>>>>>>> >
>>>>>>>> > val text = sc.textFile("s3n://my-bucket/my-1gb-text-file")
>>>>>>>> >
>>>>>>>> > val mapped = text.filter(_.trim.length > 0).flatMap(ngrams(_,
>>>>>>>> >
>>>>>>>> > So far so good; the problems come during the reduce
phase.  With
>>>>>>>> small files, I was able to issue the following to calculate
the most
>>>>>>>> frequently occurring trigram:
>>>>>>>> >
>>>>>>>> > val topNgram = (mapped countByValue) reduce((a:(String,
>>>>>>>> b:(String, Long)) => if (a._2 > b._2) a else b)
>>>>>>>> >
>>>>>>>> > With the 1 gig file, though, I've been running into
>>>>>>>> errors, so I decided to split the reduction to several steps,
starting with
>>>>>>>> simply issuing countByValue of my "mapped" RDD, but I have
yet to get it to
>>>>>>>> complete successfully.
>>>>>>>> >
>>>>>>>> > SPARK_MEM is currently set to 6154m.  I also bumped
up the
>>>>>>>> spark.akka.framesize setting to 500 (though at this point,
I was grasping
>>>>>>>> at straws; I'm not sure what a "proper" value would be).
 What properties
>>>>>>>> should I be setting for a job of this size on a cluster of
3 m1.large
>>>>>>>> slaves? (The cluster was initially configured using the spark-ec2
>>>>>>>>  Also, programmatically, what should I be doing differently?
 (For example,
>>>>>>>> should I be setting the minimum number of splits when reading
the text
>>>>>>>> file?  If so, what would be a good default?).
>>>>>>>> >
>>>>>>>> > I apologize for what I'm sure are very naive questions.
 I think
>>>>>>>> Spark is a fantastic project and have enjoyed working with
it, but I'm
>>>>>>>> still very much a newbie and would appreciate any help you
all can provide
>>>>>>>> (as well as any 'rules-of-thumb' or best practices I should
be following).
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Tim Perrigo

View raw message