drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Camuel Gilyadov <cam...@gmail.com>
Subject Re: A list of questions on Dremel (or Apache Drill)'s columnar storage
Date Mon, 27 Aug 2012 23:10:29 GMT
On Mon, Aug 27, 2012 at 8:40 PM, Min Zhou <coderplay@gmail.com> wrote:

> Hi all,
>
> I was every excited that you guys decided to start  Apache Drill, an open
> source
> version of Google's Dremel.  I was a contributor of Apache Hive, and
> skilled in Hadoop
> related development. We have a nearly 3000-nodes cluster in production, one
> of the
> largest cluster of the world.
>
> Dremel became more and more popular since Google's BigQuery was released. I
> took a interest in this nearly two years ago.This paper
> (http://research.google.com/pubs/...<
> http://research.google.com/pubs/pub36632.html>
> ) has describe how dremel organizes
> records into nested columnar data.  But  there’s almost no information
> about
> how does dremel store those columns. I have many questions on this point.
>
>
>    1. It that one file for each column?
>

I think it is an less important implementation detail. What is important
that you don't incur IO for non-projected columns.

   2. It seems that Dremel has no restriction that data must store in local
>    disk,
>     GFS or Bigtable,  all of them could be the target storage.  If in GFS,
>    how does dremel retrieve records from different nodes?
>    How to guarantee the data locality?
>

Data locality is not mandatory. It is clearly written that data is either
local or accessed remotely. Search Dremel paper or slide deck for "in-situ"
and "local".


>    3. The paper refered that "The blocks in each stripe are prefetched
>    asynchronously; the read-ahead cache typically achieves hit rates of
>    95%. " , does GFS support async prefetching?
>
>
> Have you consider the questions above? What's you answers?
>
> BTW,  Could I join you guys to start such a cool project?
>

It is open to everyone


>
>
> Thanks,
> Min
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message