tajo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henry Saputra <henry.sapu...@gmail.com>
Subject Re: Question about disk-aware scheduling in tajo
Date Wed, 19 Feb 2014 19:53:46 GMT
Cool, thx Hyunsik

On Wednesday, February 19, 2014, Hyunsik Choi <hyunsik@apache.org> wrote:

> Henry,
>
> Thank you for your comment :)
>
> In my opinion, this explanation is too specific, and the comments on the
> source code is more accessible for this kind of explanation So, now, I'll
> add this explanation on the source code as comments. Later, I'll try to
> create some design documentations.
>
> Thanks,
> Hyunsik Choi
>
>
> On Wed, Feb 19, 2014 at 10:44 AM, Henry Saputra <henry.saputra@gmail.com<javascript:;>
> >wrote:
>
> > Hyunsik,
> >
> > I am +1 this is worthy of a wiki page or separate page to explain the
> > technical flow. Tracing the flow in the code is confusing.
> >
> > Maybe similar to the doc like in YARN [1] page (the more details the
> > better ^_^)
> >
> >
> > - Henry
> >
> > [1]
> >
> http://hadoop.apache.org/docs/current2/hadoop-yarn/hadoop-yarn-site/YARN.html
> >
> > On Thu, Feb 13, 2014 at 2:18 AM, Hyunsik Choi <hyunsik@apache.org>
> wrote:
> > > Hi Min,
> > >
> > > Above all, I'm very sorry for lack of documentations on the source
> code.
> > So
> > > far, we have developed Tajo with insufficient documentations by only
> > > pursuiting a quick and dirty manner. We should fill more documentations
> > on
> > > the source code.
> > >
> > > I'm going to explain how Tajo uses disk volume locality. Before this
> > > explanation, I would like to explain the node locality that you may
> > already
> > > know. Similar to MR, Tajo also uses three level locality for each task.
> > For
> > > each task, the scheduler finds local node, closest rack, and random
> node
> > > sequentially. In Tajo, the scheduler additionally finds the local
> volume
> > > prior to finding the local node.
> > >
> > > The important thing is that we don't need to aware of actual disk
> volume
> > > IDs in each local node, and we just assigne disk volumes to TaskRunners
> > in
> > > a node in a round robin manner. It would be sufficient to improve the
> > load
> > > balancing by considering disk volume.
> > >
> > > Initially, TaskRunners are not mapped to disk volumes in each worker.
> The
> > > mapping occurs dynamically in the scheduler. For example, there are 6
> > local
> > > tasks for some node N1 which has 3 disk volumes (v1, v2, and v3). Also,
> > > three TaskRunners (T1, T2, and T3) will be running on the node N1.
> > >
> > > When tasks are added to the scheduler, the scheduler gets the disk
> volume
> > > id from each task. As you know, each volume is just an integer which is
> > > just a logical identifier for just distinguishing different disk
> volumes.
> > > Then, the scheduler builds a map between disk volume ids (obtained from
> > > BlockStorageLocation) in each node and a list of tasks
> > > (DefaultTaskScheduler::addQueryUnitAttemptId). In other words, each
> entry
> > > in the map consists of one disk volume id and a list of tasks
> > corresponding
> > > to the disk volume.
> > >
> > > When the first task is requested from a TaskRunner T1 in node N1, the
> > > scheduler just assignes the first disk volume v1 to T1, and then it
> > > schedules one task which belongs to the disk volume v1. Later, a task
> is
> > > requested from a different TaskRunner T2 from node N1, the schedules
> > > assignes the second disk volume v2 to T2, and then it schedules a task
> > > which belongs to the disk volume v2.  Also, a task request is given
> from
> > T1
> > > again, the scheduler schedules one task in the disk volume v1 to T1
> > because
> > > T1 is already mapped to v1.
> > >
> > > Like MR, Tajo uses a dynamic scheduling, and it works very well in the
> > > environments where each node has different performance disks. If you
> have
> > > additional question, please feel free to ask.
> > >
> > > Also, I'll create a Jira issue to add this explain to
> > DefaultTaskScheduler.
> > >
> > > - hyunsik
> > >
> > >
> > >
> > > On Thu, Feb 13, 2014 at 5:29 PM, Min Zhou <coderplay@gmail.com> wrote:
> > >
> > >> Hi Jihoon,
> > >>
> > >> Thank you for you answer. However, seem you didn't answer that how
> tajo
> > use
> > >> disk information to balance the io overhead.
> > >>
> > >> And still can't understand the details,  quite complex to me,
> especially
> > >> the class TaskBlockLocation
> > >>
> > >>
> > >> public static class TaskBlockLocation {
> > >>     // This is a mapping from diskId to a list of pending task, right?
> > >>     private HashMap<Integer, LinkedList<QueryUnitAttemptId>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message