spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Ash <and...@andrewash.com>
Subject Re: the spark worker assignment Question?
Date Mon, 06 Jan 2014 17:04:34 GMT
Hi Li,

I've also found this setting confusing in the past.  Take a look at this
change -- do you think it makes the setting more clear?

https://github.com/apache/incubator-spark/pull/341/files

Andrew


On Mon, Jan 6, 2014 at 8:19 AM, lihu <lihu723@gmail.com> wrote:

> Sorry for my late reply, because the gmail do not notice me.
>
> It is my fault that cause this problem.
> I take the config parameter* spark.core.max *as the maximum num in every
> machine, but it is the total number in fact.
>
> and thank Andrew and Mayur very much, your answer let understand more
> about the spark system.
>
>
>
> On Fri, Jan 3, 2014 at 2:28 AM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:
>
>> Andrew that a good point. I have done that for handling a large number of
>> queries. Typically to get good response time on large number of queries in
>> parallel, you would want them replicated on a lot of systems.
>> Regards
>> Mayur Rustagi
>> Ph: +919632149971
>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>> https://twitter.com/mayur_rustagi
>>
>>
>>
>> On Thu, Jan 2, 2014 at 11:22 PM, Andrew Ash <andrew@andrewash.com> wrote:
>>
>>> That sounds right Mayur.
>>>
>>> Also in 0.8.1 I hear there's a new repartition method that you might be
>>> able to use to further distribute the data.  But if your data is so small
>>> that it fits in just a couple blocks, why are you using 20 machines just to
>>> process a quarter GB of data?  Is the computation on each bit extremely
>>> intensive?
>>>
>>>
>>> On Thu, Jan 2, 2014 at 12:39 PM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:
>>>
>>>> I have experienced a similar issue. The easiest fix I found was to
>>>> increase the replication of the data being used in the worker to the number
>>>> of workers you want to use for processing. The RDD seem to created on all
>>>> the machines where the blocks are replicated. Please correct me if I am
>>>> wrong.
>>>>
>>>> Regards
>>>> Mayur
>>>>
>>>> Mayur Rustagi
>>>> Ph: +919632149971
>>>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>>>> https://twitter.com/mayur_rustagi
>>>>
>>>>
>>>>
>>>> On Thu, Jan 2, 2014 at 10:46 PM, Andrew Ash <andrew@andrewash.com>wrote:
>>>>
>>>>> Hi lihu,
>>>>>
>>>>> Maybe the data you're accessing is in in HDFS and only resides on 4 of
>>>>> your 20 machines because it's only about 4 blocks (at default 64MB /
block
>>>>> that's around a quarter GB).  Where is your source data located and how
is
>>>>> it stored?
>>>>>
>>>>> Andrew
>>>>>
>>>>>
>>>>> On Thu, Jan 2, 2014 at 7:53 AM, lihu <lihu723@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>    I run  spark on a cluster with 20 machine, but when I start an
>>>>>> application use the spark-shell, there only 4 machine is working
, the
>>>>>> other with just idle, without memery and cpu used, I watch this through
>>>>>> webui.
>>>>>>
>>>>>>    I wonder the other machine maybe  busy, so i watch the machines
>>>>>> using  "top" and "free" command, but this is not。
>>>>>>
>>>>>>   * So I just wonder why not spark assignment work to all all the
20
>>>>>> machine? this is not a good resource usage.*
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> *Best Wishes!*
>
> *Li Hu(李浒) | Graduate Student*
>
> *Institute for Interdisciplinary Information Sciences(IIIS
> <http://iiis.tsinghua.edu.cn/>)*
> *Tsinghua University, China*
>
> *Email: lihu723@gmail.com <lihu723@gmail.com>*
> *Tel  : +86 15120081920 <%2B86%2015120081920>*
> *Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
> <http://iiis.tsinghua.edu.cn/zh/lihu/>*
>
>
>

Mime
View raw message