drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Min Zhou <coderp...@gmail.com>
Subject A list of questions on Dremel (or Apache Drill)'s columnar storage
Date Mon, 27 Aug 2012 17:40:41 GMT
Hi all,

I was every excited that you guys decided to start  Apache Drill, an open
version of Google's Dremel.  I was a contributor of Apache Hive, and
skilled in Hadoop
related development. We have a nearly 3000-nodes cluster in production, one
of the
largest cluster of the world.

Dremel became more and more popular since Google's BigQuery was released. I
took a interest in this nearly two years ago.This paper
) has describe how dremel organizes
records into nested columnar data.  But  there’s almost no information
how does dremel store those columns. I have many questions on this point.

   1. It that one file for each column?
   2. It seems that Dremel has no restriction that data must store in local
    GFS or Bigtable,  all of them could be the target storage.  If in GFS,
   how does dremel retrieve records from different nodes?
   How to guarantee the data locality?
   3. The paper refered that "The blocks in each stripe are prefetched
   asynchronously; the read-ahead cache typically achieves hit rates of
   95%. " , does GFS support async prefetching?

Have you consider the questions above? What's you answers?

BTW,  Could I join you guys to start such a cool project?


My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
My blog:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message