spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: how to set task number?
Date Mon, 26 May 2014 03:14:44 GMT
What is the format of your input data, prior to insertion into Tachyon?


On Sun, May 25, 2014 at 7:52 PM, qingyang li <liqingyang1985@gmail.com>wrote:

> i tried "set mapred.map.tasks=30" , it does not work, it seems shark does
> not support this setting.
> i also tried "SET mapred.max.split.size=64000000", it does not work,too.
> is there other way to control task number in shark CLI ?
>
>
>
> 2014-05-26 10:38 GMT+08:00 Aaron Davidson <ilikerps@gmail.com>:
>
> You can try setting "mapred.map.tasks" to get Hive to do the right thing.
>>
>>
>> On Sun, May 25, 2014 at 7:27 PM, qingyang li <liqingyang1985@gmail.com>wrote:
>>
>>> Hi, Aaron, thanks for sharing.
>>>
>>> I am using shark to execute query , and table is created on tachyon. I
>>> think i can not using RDD#repartition() in shark CLI;
>>> if shark support "SET mapred.max.split.size" to control file size ?
>>> if yes,  after i create table, i can control file num,  then   I can
>>> control task number.
>>> if not , do anyone know other way to control task number in shark CLI?
>>>
>>>
>>> 2014-05-26 9:36 GMT+08:00 Aaron Davidson <ilikerps@gmail.com>:
>>>
>>> How many partitions are in your input data set? A possibility is that
>>>> your input data has 10 unsplittable files, so you end up with 10
>>>> partitions. You could improve this by using RDD#repartition().
>>>>
>>>> Note that mapPartitionsWithIndex is sort of the "main processing loop"
>>>> for many Spark functions. It is iterating through all the elements of the
>>>> partition and doing some computation (probably running your user code) on
>>>> it.
>>>>
>>>> You can see the number of partitions in your RDD by visiting the Spark
>>>> driver web interface. To access this, visit port 8080 on host running your
>>>> Standalone Master (assuming you're running standalone mode), which will
>>>> have a link to the application web interface. The Tachyon master also has
a
>>>> useful web interface, available at port 19999.
>>>>
>>>>
>>>> On Sun, May 25, 2014 at 5:43 PM, qingyang li <liqingyang1985@gmail.com>wrote:
>>>>
>>>>> hi, Mayur, thanks for replying.
>>>>> I know spark application should take all cores by default. My question
>>>>> is  how to set task number on each core ?
>>>>> If one silce, one task,  how can i set silce file size ?
>>>>>
>>>>>
>>>>> 2014-05-23 16:37 GMT+08:00 Mayur Rustagi <mayur.rustagi@gmail.com>:
>>>>>
>>>>> How many cores do you see on your spark master (8080 port).
>>>>>> By default spark application should take all cores when you launch
>>>>>> it. Unless you have set max core configuration.
>>>>>>
>>>>>>
>>>>>> Mayur Rustagi
>>>>>> Ph: +1 (760) 203 3257
>>>>>> http://www.sigmoidanalytics.com
>>>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, May 22, 2014 at 4:07 PM, qingyang li <
>>>>>> liqingyang1985@gmail.com> wrote:
>>>>>>
>>>>>>> my aim of setting task number is to increase the query speed,
   and
>>>>>>> I have also found " mapPartitionsWithIndex at Operator.scala:333<http://192.168.1.101:4040/stages/stage?id=17>"
>>>>>>> is costing much time.  so, my another question is :
>>>>>>> how to tunning mapPartitionsWithIndex<http://192.168.1.101:4040/stages/stage?id=17>
>>>>>>> to make the costing time down?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2014-05-22 18:09 GMT+08:00 qingyang li <liqingyang1985@gmail.com>:
>>>>>>>
>>>>>>> i have added  SPARK_JAVA_OPTS+="-Dspark.
>>>>>>>> default.parallelism=40 "  in shark-env.sh,
>>>>>>>> but i find there are only10 tasks on the cluster and 2 tasks
each
>>>>>>>> machine.
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-05-22 18:07 GMT+08:00 qingyang li <liqingyang1985@gmail.com>:
>>>>>>>>
>>>>>>>> i have added  SPARK_JAVA_OPTS+="-Dspark.default.parallelism=40
"
>>>>>>>>> in shark-env.sh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2014-05-22 17:50 GMT+08:00 qingyang li <liqingyang1985@gmail.com>:
>>>>>>>>>
>>>>>>>>> i am using tachyon as storage system and using to shark
to query a
>>>>>>>>>> table which is a bigtable, i have 5 machines as a
spark cluster, there are
>>>>>>>>>> 4 cores on each machine .
>>>>>>>>>> My question is:
>>>>>>>>>> 1. how to set task number on each core?
>>>>>>>>>> 2. where to see how many partitions of one RDD?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message