drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Querying wide rows with Drill
Date Tue, 11 Nov 2014 20:54:34 GMT
>From the context of the original post, I don't expect to see more than
dozens to perhaps thousands of columns.  In fact, it is common for this
wide table format to represent the minority of the data and for a blobish
array format to make up the remainder.  I think that the approaches
detailed in our book on time series [1] are quite similar to the OP's

The retrieval typically requires reading thousands to hundreds of thousands
of rows and results in thousands to low millions of rows after flattening.
There would likely be some performance boost if the data source drives all
the way to the fully flattened form, but I am dubious about the scale of
that improvement compared to the flatten approach.

I base this on the experience from the Java code base in Open TSDB.  There,
it is common for even large blob format queries to be dominated by actual
processing rather data marshalling from the database format.  Drill is
likely to do even better since it can fully parallelize the marshalling
across many drill bits.


On Tue, Nov 11, 2014 at 2:45 PM, Steven Phillips <sphillips@maprtech.com>

> To clarify, when I said a new HBaseRecordReader, I was referring to the
> Drill class that reads data using the HBase client and writes into the
> ValueVectors. In the current implementation, we have a vector for each
> column, which would mean for a sparse table, we would end up with
> potentially millions of vectors, which would not be very efficient at all.
> In the new implementation, we would simply have a RepeatedMapVector, with a
> Key and Value vector nested inside. You are correct that this will work
> without any special support from DB layer.
> On Tue, Nov 11, 2014 at 12:37 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> > On Tue, Nov 11, 2014 at 1:46 PM, Steven Phillips <sphillips@maprtech.com
> >
> > wrote:
> >
> > > For this to really work well in your case, I think we need to be able
> to
> > > push the "mappify" operation into the scan. In other words, we need the
> > > hbase scan to ouptut the records in the desired key/value format.
> > > Currently, hbase scan will output in the normal, sparse column schema,
> > and
> > > then a separate operator would convert it.
> > >
> > > One way to do this would be to write a new HBaseRecordReader that
> outputs
> > > in the key/value mode, and then have a System/session option to set
> which
> > > mode to use.
> > >
> >
> > Actually, I think that what you suggest would be plenty fast even without
> > any special support in the DB layer.  The key limitation is rows per
> second
> > retrieved from the DB, not rows per second processed by drill.
> >
> > THis is *very* exciting.
> >
> --
>  Steven Phillips
>  Software Engineer
>  mapr.com

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message