spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xi Shen <davidshe...@gmail.com>
Subject Re: k-means can only run on one executor with one thread?
Date Sat, 28 Mar 2015 10:11:02 GMT
My vector dimension is like 360 or so. The data count is about 270k. My
driver has 2.9G memory. I attache a screenshot of current executor status.
I submitted this job with "--master yarn-cluster". I have a total of 7
worker node, one of them acts as the driver. In the screenshot, you can see
all worker nodes have loaded some data, but the driver is not loaded with
any data.

But the funny thing is, when I log on to the driver, and check its CPU &
memory status. I saw one java process using about 18% of CPU, and is using
about 1.6 GB memory.

[image: Inline image 1]

On Sat, Mar 28, 2015 at 7:06 PM Reza Zadeh <reza@databricks.com> wrote:

> How many dimensions does your data have? The size of the k-means model is
> k * d, where d is the dimension of the data.
>
> Since you're using k=1000, if your data has dimension higher than say,
> 10,000, you will have trouble, because k*d doubles have to fit in the
> driver.
>
> Reza
>
> On Sat, Mar 28, 2015 at 12:27 AM, Xi Shen <davidshen84@gmail.com> wrote:
>
>> I have put more detail of my problem at http://stackoverflow.com/
>> questions/29295420/spark-kmeans-computation-cannot-be-distributed
>>
>> It is really appreciate if you can help me take a look at this problem. I
>> have tried various settings and ways to load/partition my data, but I just
>> cannot get rid that long pause.
>>
>>
>> Thanks,
>> David
>>
>>
>>
>>
>>
>> [image: --]
>> Xi Shen
>> [image: http://]about.me/davidshen
>> <http://about.me/davidshen?promo=email_sig>
>>   <http://about.me/davidshen>
>>
>> On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen <davidshen84@gmail.com> wrote:
>>
>>> Yes, I have done repartition.
>>>
>>> I tried to repartition to the number of cores in my cluster. Not
>>> helping...
>>> I tried to repartition to the number of centroids (k value). Not
>>> helping...
>>>
>>>
>>> On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley <joseph@databricks.com>
>>> wrote:
>>>
>>>> Can you try specifying the number of partitions when you load the data
>>>> to equal the number of executors?  If your ETL changes the number of
>>>> partitions, you can also repartition before calling KMeans.
>>>>
>>>>
>>>> On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen <davidshen84@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a large data set, and I expects to get 5000 clusters.
>>>>>
>>>>> I load the raw data, convert them into DenseVector; then I did
>>>>> repartition and cache; finally I give the RDD[Vector] to KMeans.train().
>>>>>
>>>>> Now the job is running, and data are loaded. But according to the
>>>>> Spark UI, all data are loaded onto one executor. I checked that executor,
>>>>> and its CPU workload is very low. I think it is using only 1 of the 8
>>>>> cores. And all other 3 executors are at rest.
>>>>>
>>>>> Did I miss something? Is it possible to distribute the workload to all
>>>>> 4 executors?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>>
>>>>
>>
>

Mime
View raw message