spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: Why k-means cluster hang for a long time?
Date Mon, 30 Mar 2015 17:00:40 GMT
Hi Xi,

Please create a JIRA if it takes longer to locate the issue. Did you
try a smaller k?

Best,
Xiangrui

On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen <davidshen84@gmail.com> wrote:
> Hi Burak,
>
> After I added .repartition(sc.defaultParallelism), I can see from the log
> the partition number is set to 32. But in the Spark UI, it seems all the
> data are loaded onto one executor. Previously they were loaded onto 4
> executors.
>
> Any idea?
>
>
> Thanks,
> David
>
>
> On Fri, Mar 27, 2015 at 11:01 AM Xi Shen <davidshen84@gmail.com> wrote:
>>
>> How do I get the number of cores that I specified at the command line? I
>> want to use "spark.default.parallelism". I have 4 executors, each has 8
>> cores. According to
>> https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior,
>> the "spark.default.parallelism" value will be 4 * 8 = 32...I think it is too
>> large, or inappropriate. Please give some suggestion.
>>
>> I have already used cache, and count to pre-cache.
>>
>> I can try with smaller k for testing, but eventually I will have to use k
>> = 5000 or even large. Because I estimate our data set would have that much
>> of clusters.
>>
>>
>> Thanks,
>> David
>>
>>
>> On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz <brkyvz@gmail.com> wrote:
>>>
>>> Hi David,
>>> The number of centroids (k=5000) seems too large and is probably the
>>> cause of the code taking too long.
>>>
>>> Can you please try the following:
>>> 1) Repartition data to the number of available cores with
>>> .repartition(numCores)
>>> 2) cache data
>>> 3) call .count() on data right before k-means
>>> 4) try k=500 (even less if possible)
>>>
>>> Thanks,
>>> Burak
>>>
>>> On Mar 26, 2015 4:15 PM, "Xi Shen" <davidshen84@gmail.com> wrote:
>>> >
>>> > The code is very simple.
>>> >
>>> > val data = sc.textFile("very/large/text/file") map { l =>
>>> >   // turn each line into dense vector
>>> >   Vectors.dense(...)
>>> > }
>>> >
>>> > // the resulting data set is about 40k vectors
>>> >
>>> > KMeans.train(data, k=5000, maxIterations=500)
>>> >
>>> > I just kill my application. In the log I found this:
>>> >
>>> > 15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of
>>> > block broadcast_26_piece0
>>> > 15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in
>>> > connection from
>>> > workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277
>>> > java.io.IOException: An existing connection was forcibly closed by the
>>> > remote host
>>> >
>>> > Notice the time gap. I think it means the work node did not generate
>>> > any log at all for about 12hrs...does it mean they are not working at all?
>>> >
>>> > But when testing with very small data set, my application works and
>>> > output expected data.
>>> >
>>> >
>>> > Thanks,
>>> > David
>>> >
>>> >
>>> > On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz <brkyvz@gmail.com> wrote:
>>> >>
>>> >> Can you share the code snippet of how you call k-means? Do you cache
>>> >> the data before k-means? Did you repartition the data?
>>> >>
>>> >> On Mar 26, 2015 4:02 PM, "Xi Shen" <davidshen84@gmail.com> wrote:
>>> >>>
>>> >>> OH, the job I talked about has ran more than 11 hrs without a
>>> >>> result...it doesn't make sense.
>>> >>>
>>> >>>
>>> >>> On Fri, Mar 27, 2015 at 9:48 AM Xi Shen <davidshen84@gmail.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> Hi Burak,
>>> >>>>
>>> >>>> My iterations is set to 500. But I think it should also stop
of the
>>> >>>> centroid coverages, right?
>>> >>>>
>>> >>>> My spark is 1.2.0, working in windows 64 bit. My data set is
about
>>> >>>> 40k vectors, each vector has about 300 features, all normalised.
All work
>>> >>>> node have sufficient memory and disk space.
>>> >>>>
>>> >>>> Thanks,
>>> >>>> David
>>> >>>>
>>> >>>>
>>> >>>> On Fri, 27 Mar 2015 02:48 Burak Yavuz <brkyvz@gmail.com>
wrote:
>>> >>>>>
>>> >>>>> Hi David,
>>> >>>>>
>>> >>>>> When the number of runs are large and the data is not properly
>>> >>>>> partitioned, it seems that K-Means is hanging according
to my experience.
>>> >>>>> Especially setting the number of runs to something high
drastically
>>> >>>>> increases the work in executors. If that's not the case,
can you give more
>>> >>>>> info on what Spark version you are using, your setup, and
your dataset?
>>> >>>>>
>>> >>>>> Thanks,
>>> >>>>> Burak
>>> >>>>>
>>> >>>>> On Mar 26, 2015 5:10 AM, "Xi Shen" <davidshen84@gmail.com>
wrote:
>>> >>>>>>
>>> >>>>>> Hi,
>>> >>>>>>
>>> >>>>>> When I run k-means cluster with Spark, I got this in
the last two
>>> >>>>>> lines in the log:
>>> >>>>>>
>>> >>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned
broadcast 26
>>> >>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned
shuffle 5
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Then it hangs for a long time. There's no active job.
The driver
>>> >>>>>> machine is idle. I cannot access the work node, I am
not sure if they are
>>> >>>>>> busy.
>>> >>>>>>
>>> >>>>>> I understand k-means may take a long time to finish.
But why no
>>> >>>>>> active job? no log?
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Thanks,
>>> >>>>>> David
>>> >>>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message