Hi,
I read theTajo-0.9.0 source code, I found Tajo using a simple FIFO
scheduler,
I accept this in the current stage. but when Tajo peek a query from the
scheduler queue, then allocate workers for this query,
Allocator only consider availale resource on a random worker list, then
specify a set of workers.
1)
so My question is why we don't consider HDFS locatbility? otherwise network
will be the bottleneck.
I understand Tajo don't use YARN as a scheduler currently. and write a
temporary simple FIFO scheduler. and I am also looked at
https://issues.apache.org/jira/browse/TAJO-540 , I hope new Tajo scheduler
similar to Sparrow.
2) performance related.
I setup 10 nodes clusters, (1 master, 9 workers)
64GB mem, 24CPU, 12*4TB HDD, 1.6GB test data.(160 million records)
It's works good for some agg sql tests except count(distinct)
count(distinct) is very slow - ten minutes.
who can give me a simple explanation of how Tajo works with
count(distinct), I can share my tajo-site here:
<configuration>
<property>
<name>tajo.rootdir</name>
<value>hdfs://realtime-cluster/tajo</value>
</property>
<!-- master -->
<property>
<name>tajo.master.umbilical-rpc.address</name>
<value>xx:26001</value>
</property>
<property>
<name>tajo.master.client-rpc.address</name>
<value>xx:26002</value>
</property>
<property>
<name>tajo.master.info-http.address</name>
<value>xx:26080</value>
</property>
<property>
<name>tajo.resource-tracker.rpc.address</name>
<value>xx:26003</value>
</property>
<property>
<name>tajo.catalog.client-rpc.address</name>
<value>xx:26005</value>
</property>
<!-- worker -->
<property>
<name>tajo.worker.tmpdir.locations</name>
<value>file:///data/hadoop/data1/tajo,file:///data/hadoop/data2/tajo,file:///data/hadoop/data3/tajo,file:///data/hadoop/data4/tajo,file:///data/hadoop/data5/tajo,file:///data/hadoop/data6/tajo,file:///data/hadoop/data7/tajo,file:///data/hadoop/data8/tajo,file:///data/hadoop/data9/tajo,file:///data/hadoop/data10/tajo,file:///data/hadoop/data11/tajo,file:///data/hadoop/data12/tajo</value>
</property>
<property>
<name>tajo.worker.tmpdir.cleanup-at-startup</name>
<value>true</value>
</property>
<property>
<name>tajo.worker.history.expire-interval-minutes</name>
<value>60</value>
</property>
<property>
<name>tajo.worker.resource.tajo.worker.resource.cpu-cores</name>
<value>24</value>
</property>
<property>
<name>tajo.worker.resource.memory-mb</name>
<value>60512</value> <!-- 3584 3 tasks + 1 qm task -->
</property>
<property>
<name>tajo.task.memory-slot-mb.default</name>
<value>3000</value> <!-- default 512 -->
</property>
<property>
<name>tajo.task.disk-slot.default</name>
<value>1.0f</value> <!-- default 0.5 -->
</property>
<property>
<name>tajo.shuffle.fetcher.parallel-execution.max-num</name>
<value>5</value>
</property>
<property>
<name>tajo.executor.external-sort.thread-num</name>
<value>2</value>
</property>
<!-- client -->
<property>
<name>tajo.rpc.client.worker-thread-num</name>
<value>4</value>
</property>
<property>
<name>tajo.cli.print.pause</name>
<value>false</value>
</property>
<!--
<property>
<name>tajo.worker.resource.dfs-dir-aware</name>
<value>true</value>
</property>
<property>
<name>tajo.worker.resource.dedicated</name>
<value>true</value>
</property>
<property>
<name>tajo.worker.resource.dedicated-memory-ratio</name>
<value>0.6</value>
</property>
-->
</configuration>
tajo-env:
export TAJO_WORKER_HEAPSIZE=60000
|