On Mon, Aug 27, 2012 at 8:40 PM, Min Zhou <coderplay@gmail.com> wrote:
> Hi all,
>
> I was every excited that you guys decided to start Apache Drill, an open
> source
> version of Google's Dremel. I was a contributor of Apache Hive, and
> skilled in Hadoop
> related development. We have a nearly 3000-nodes cluster in production, one
> of the
> largest cluster of the world.
>
> Dremel became more and more popular since Google's BigQuery was released. I
> took a interest in this nearly two years ago.This paper
> (http://research.google.com/pubs/...<
> http://research.google.com/pubs/pub36632.html>
> ) has describe how dremel organizes
> records into nested columnar data. But there’s almost no information
> about
> how does dremel store those columns. I have many questions on this point.
>
>
> 1. It that one file for each column?
>
I think it is an less important implementation detail. What is important
that you don't incur IO for non-projected columns.
2. It seems that Dremel has no restriction that data must store in local
> disk,
> GFS or Bigtable, all of them could be the target storage. If in GFS,
> how does dremel retrieve records from different nodes?
> How to guarantee the data locality?
>
Data locality is not mandatory. It is clearly written that data is either
local or accessed remotely. Search Dremel paper or slide deck for "in-situ"
and "local".
> 3. The paper refered that "The blocks in each stripe are prefetched
> asynchronously; the read-ahead cache typically achieves hit rates of
> 95%. " , does GFS support async prefetching?
>
>
> Have you consider the questions above? What's you answers?
>
> BTW, Could I join you guys to start such a cool project?
>
It is open to everyone
>
>
> Thanks,
> Min
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>
|