spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mayur Rustagi <mayur.rust...@gmail.com>
Subject Re: the spark worker assignment Question?
Date Thu, 02 Jan 2014 18:28:13 GMT
Andrew that a good point. I have done that for handling a large number of
queries. Typically to get good response time on large number of queries in
parallel, you would want them replicated on a lot of systems.
Regards
Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Thu, Jan 2, 2014 at 11:22 PM, Andrew Ash <andrew@andrewash.com> wrote:

> That sounds right Mayur.
>
> Also in 0.8.1 I hear there's a new repartition method that you might be
> able to use to further distribute the data.  But if your data is so small
> that it fits in just a couple blocks, why are you using 20 machines just to
> process a quarter GB of data?  Is the computation on each bit extremely
> intensive?
>
>
> On Thu, Jan 2, 2014 at 12:39 PM, Mayur Rustagi <mayur.rustagi@gmail.com>wrote:
>
>> I have experienced a similar issue. The easiest fix I found was to
>> increase the replication of the data being used in the worker to the number
>> of workers you want to use for processing. The RDD seem to created on all
>> the machines where the blocks are replicated. Please correct me if I am
>> wrong.
>>
>> Regards
>> Mayur
>>
>> Mayur Rustagi
>> Ph: +919632149971
>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>> https://twitter.com/mayur_rustagi
>>
>>
>>
>> On Thu, Jan 2, 2014 at 10:46 PM, Andrew Ash <andrew@andrewash.com> wrote:
>>
>>> Hi lihu,
>>>
>>> Maybe the data you're accessing is in in HDFS and only resides on 4 of
>>> your 20 machines because it's only about 4 blocks (at default 64MB / block
>>> that's around a quarter GB).  Where is your source data located and how is
>>> it stored?
>>>
>>> Andrew
>>>
>>>
>>> On Thu, Jan 2, 2014 at 7:53 AM, lihu <lihu723@gmail.com> wrote:
>>>
>>>> Hi,
>>>>    I run  spark on a cluster with 20 machine, but when I start an
>>>> application use the spark-shell, there only 4 machine is working , the
>>>> other with just idle, without memery and cpu used, I watch this through
>>>> webui.
>>>>
>>>>    I wonder the other machine maybe  busy, so i watch the machines
>>>> using  "top" and "free" command, but this is not。
>>>>
>>>>   * So I just wonder why not spark assignment work to all all the 20
>>>> machine? this is not a good resource usage.*
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message