drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Camuel Gilyadov <cam...@gmail.com>
Subject Re: A list of questions on Dremel (or Apache Drill)'s columnar storage
Date Mon, 27 Aug 2012 23:10:29 GMT
On Mon, Aug 27, 2012 at 8:40 PM, Min Zhou <coderplay@gmail.com> wrote:

> Hi all,
> I was every excited that you guys decided to start  Apache Drill, an open
> source
> version of Google's Dremel.  I was a contributor of Apache Hive, and
> skilled in Hadoop
> related development. We have a nearly 3000-nodes cluster in production, one
> of the
> largest cluster of the world.
> Dremel became more and more popular since Google's BigQuery was released. I
> took a interest in this nearly two years ago.This paper
> (http://research.google.com/pubs/...<
> http://research.google.com/pubs/pub36632.html>
> ) has describe how dremel organizes
> records into nested columnar data.  But  there’s almost no information
> about
> how does dremel store those columns. I have many questions on this point.
>    1. It that one file for each column?

I think it is an less important implementation detail. What is important
that you don't incur IO for non-projected columns.

   2. It seems that Dremel has no restriction that data must store in local
>    disk,
>     GFS or Bigtable,  all of them could be the target storage.  If in GFS,
>    how does dremel retrieve records from different nodes?
>    How to guarantee the data locality?

Data locality is not mandatory. It is clearly written that data is either
local or accessed remotely. Search Dremel paper or slide deck for "in-situ"
and "local".

>    3. The paper refered that "The blocks in each stripe are prefetched
>    asynchronously; the read-ahead cache typically achieves hit rates of
>    95%. " , does GFS support async prefetching?
> Have you consider the questions above? What's you answers?
> BTW,  Could I join you guys to start such a cool project?

It is open to everyone

> Thanks,
> Min
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message