spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Georgios Samaras <georgesamaras...@gmail.com>
Subject Re: KMeans calls takeSample() twice?
Date Wed, 31 Aug 2016 16:29:09 GMT
But as can see in this:

  Stackoverflow question
<http://stackoverflow.com/questions/38986395/sparkkmeans-calls-takesample-twice>

this is simply not the case for me. Could it because your data is not big
enough to reproduce?

Or could it be that you are actually reproducing it, but the problem comes
with the weakness of the UI to display it correctly?

On Wed, Aug 31, 2016 at 4:38 AM, Yanbo Liang <ybliang8@gmail.com> wrote:

> I added println at the start of function takeSample, and found it was
> printed only once for each run of KMeans.
>
> Thanks
> Yanbo
>
> On Tue, Aug 30, 2016 at 10:31 AM, Georgios Samaras <
> georgesamarasdit@gmail.com> wrote:
>
>> Good catch Shivaram. However, the very next line states:
>>
>> // this shouldn't happen often because we use a big multiplier for the
>> initial size
>>
>> which makes me wondering if that is the case, really, since I am
>> experimenting heavily right now and I launched 30~40 jobs, and from a
>> glance on them I can see takeSample() being called twice!
>>
>> George
>>
>>
>> On Tue, Aug 30, 2016 at 10:20 AM, Shivaram Venkataraman <
>> shivaram@eecs.berkeley.edu> wrote:
>>
>>> I think takeSample itself runs multiple jobs if the amount of samples
>>> collected in the first pass is not enough. The comment and code path
>>> at https://github.com/apache/spark/blob/412b0e8969215411b97efd3
>>> d0984dc6cac5d31e0/core/src/main/scala/org/apache/spark/rdd/
>>> RDD.scala#L508
>>> should explain when this happens. Also you can confirm this by
>>> checking if the logWarning shows up in your logs.
>>>
>>> Thanks
>>> Shivaram
>>>
>>> On Tue, Aug 30, 2016 at 9:50 AM, Georgios Samaras
>>> <georgesamarasdit@gmail.com> wrote:
>>> >
>>> > ---------- Forwarded message ----------
>>> > From: Georgios Samaras <georgesamarasdit@gmail.com>
>>> > Date: Tue, Aug 30, 2016 at 9:49 AM
>>> > Subject: Re: KMeans calls takeSample() twice?
>>> > To: "Sean Owen [via Apache Spark Developers List]"
>>> > <ml-node+s1001551n18788h58@n3.nabble.com>
>>> >
>>> >
>>> > I am not sure what you want me to check. Note that I see two
>>> takeSample()s
>>> > being invoked every single time I execute KMeans(). In a current job I
>>> have,
>>> > I did view the details and updated the:
>>> >
>>> > StackOverflow question.
>>> >
>>> >
>>> >
>>> > On Tue, Aug 30, 2016 at 9:25 AM, Sean Owen [via Apache Spark Developers
>>> > List] <ml-node+s1001551n18788h58@n3.nabble.com> wrote:
>>> >>
>>> >> I'm not sure it's a UI bug; it really does record two different
>>> >> stages, the second of which executes quickly. I am not sure why that
>>> >> would happen off the top of my head. I don't see anything that failed
>>> >> here.
>>> >>
>>> >> Digging into those two stages and what they executed might give a clue
>>> >> to what's really going on there.
>>> >>
>>> >> On Tue, Aug 30, 2016 at 5:18 PM, gsamaras <[hidden email]> wrote:
>>> >> > Yanbo thank you for your reply. So you are saying that this is
a
>>> bug in
>>> >> > the
>>> >> > Spark UI in general, and not in the local Spark UI of our cluster,
>>> where
>>> >> > I
>>> >> > work, right?
>>> >> >
>>> >> > George
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe e-mail: [hidden email]
>>> >>
>>> >>
>>> >>
>>> >> ________________________________
>>> >> If you reply to this email, your message will be added to the
>>> discussion
>>> >> below:
>>> >>
>>> >> http://apache-spark-developers-list.1001551.n3.nabble.com/KM
>>> eans-calls-takeSample-twice-tp18761p18788.html
>>> >> To unsubscribe from KMeans calls takeSample() twice?, click here.
>>> >> NAML
>>> >
>>> >
>>> >
>>>
>>
>>
>

Mime
View raw message