tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jihoon Son <jihoon...@apache.org>
Subject Re: Question about disk-aware scheduling in tajo
Date Thu, 13 Feb 2014 06:17:33 GMT
Hi, Min.

In DefaultTaskScheduler, each container is mapped to each disk of all nodes
in a cluster. When a container requests a task, DefaultTaskScheduler
selects a closest task and assigns it to the container. This process works
for only the local reads. The disk volume information is not considered for
remote reads.

In my opinion, this is enough for us because there are few remote tasks in
each sub query. From a test on an in-house cluster composed of 32 nodes,
the ratio of remote tasks to whole tasks was only about 0.17% (The query
was 'select l_orderkey from lineitem', and the volume of the lineitem table
was about 1TB.). Since the number of tasks was very small, there were small
disk contentions.

Hope that answers your questions.

2014-02-13 11:00 GMT+09:00 Min Zhou <coderplay@gmail.com>:

> Hi all,
> Tajo leverages the feature supported by HDFS-3672, which exposes the disk
> volume id of each hdfs data block.  I already found the related code in
> DefaultTaskScheduler.assignToLeafTasks,  can anyone explain the logic for
> me?  What the scheduler do when the hdfs read is a remote read on the
> other
> machine's disk?
> Thanks,
> Min
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message