drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacques.dr...@gmail.com>
Subject Re: Getting plugged in... (Cassandra and Drill?)
Date Tue, 22 Jan 2013 22:41:24 GMT
Hey Brian,

Yeah, the storage engine APIs haven't been defined yet.  Expounding a bit
on the high-level goals include what we had in the JIRA:

The primary interface is the Storage Engine Capabilities API.  It should
describe everything that the particular storage engine supports.  This
includes whether the storage engine supports serialization,
deserialization, what types of logical operator capabilities it supports
internally.  It also needs to include a description of statistics
capabilities (e.g. supports approximate row keys, average row size, total
data size, data distribution statistics, etc) and metadata capabilities

Statistics API: Provide the actual statistics information that is utilized
during query planning.
Metadata API: Provide information about the available sub data sources
(tables, keyspaces, etc) along with locality information, schema
information, type information, primary and secondary indices
types, partitioning information,  etc.  Portions of this information are
used in query parsing.  Others in query planning.  Others portions in
Execution planning.
DeserializationAPI - Convert a particular data source into one of our two
canonical in-memory formats.  (row-based or column-based).  Additionally
support particular types of logical operation pushdown.
Serialization - Serialize the in-memory format back into the persistent
storage format.

If you wanted to take a look at other projects existing interfaces around
each of these things and then try to draw up a design, that would be really


On Mon, Jan 21, 2013 at 8:20 PM, Brian O'Neill <bone@alumni.brown.edu>wrote:

> Hey crew. Thanks for all the useful replies.
> With respect to data model/selective queries:
> Understood.  I am open to and anticipated creating wide-row indexes that
> would cut down on the range queries.  With the right number of wide-row
> indexes that support the appropriate dimensions, we can probably cut down
> on the requisite full table scans.
> I'm even open to creating a CF/table specifically to support the Dremel
> data model.  (And I'm looking at the recent release of Cassandra native
> support for collections to see if they help with that approach)
> http://brianoneill.blogspot.com/2013/01/native-support-for-collections-in.h
> tml<http://brianoneill.blogspot.com/2013/01/native-support-for-collections-in.html>
> For cases where wide-rows can't be constructed (e.g. We can't fully
> anticipate the dimensions needed),  we might be able to handle full-table
> scans if we made the Drill API implementation aware of the
> partitions/token-space in Cassandra. I saw that you mention locality on
> DRILL-13, vnode information from Cassandra might help there. With that, at
> least you could send the queries to the right host.
> (thinking outloud)
> Regardless, I can certainly come up with a straw-man data model that I
> believe is common in the Cassandra community, and we can brainstorm to see
> what makes sense.
> I'm certainly game for taking on DRILL-16 and contributing to DRILL-13.
> Solving this is a priority for us and Drill seems promising.
> I didn't see any pointers to the Storage Engine API on the issue.  I've
> got the code down from github, but didn't see much:
> bone@zen:~/git/boneill42/incubator-drill/sandbox-> find . -name '*.java' |
> grep storage
> ./prototype/contrib/storage-hbase/src/main/java/org/apache/drill/App.java
> ./prototype/contrib/storage-hbase/src/test/java/org/apache/drill/AppTest.ja
> va
> Can anybody point me in the right direction?
> -brian
> ---
> Brian O'Neill
> Lead Architect, Software Development
> Health Market Science
> The Science of Better Results
> 2700 Horizon Drive € King of Prussia, PA € 19406
> M: 215.588.6024 € @boneill42 <http://www.twitter.com/boneill42>  €
> healthmarketscience.com
> This information transmitted in this email message is for the intended
> recipient only and may contain confidential and/or privileged material. If
> you received this email in error and are not the intended recipient, or
> the person responsible to deliver it to the intended recipient, please
> contact the sender at the email above and delete this email and any
> attachments and destroy any copies thereof. Any review, retransmission,
> dissemination, copying or other use of, or taking any action in reliance
> upon, this information by persons or entities other than the intended
> recipient is strictly prohibited.
> On 1/21/13 2:23 PM, "Jacques Nadeau" <jacques.drill@gmail.com> wrote:
> >Hey Brian,
> >
> >Welcome to the list!
> >
> >Here are some thoughts
> >
> >On Sun, Jan 20, 2013 at 8:37 PM, Brian O'Neill
> ><bone@alumni.brown.edu>wrote:
> >
> >> Last week, Brad Anderson came up and presented at the PhillyDB meetup.
> >> http://www.slideshare.net/boorad/phillydb-talk-beyond-batch
> >>
> >> He gave us an overview of Drill, and I'm curious...
> >>
> >> Presently, we heavily use Storm + Cassandra.
> >>
> >>
> >>
> http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-a
> >>nd.html
> >>
> >> We treat CRUD operations as events. Then within Storm we calculate
> >> aggregate counts of entities flowing through the system by various
> >> dimensions.   That works well, but we still need an ad hoc reporting
> >> capability, and a way to report on data in the system that is not
> >> active (historical).
> >>
> >> Seems like a great use case for Drill.
> >
> >
> >> Would it be possible to use the Drill engine against a Cassandra
> >>backend?
> >> If so, what does that mean?   (implementing some API?)
> >>
> >
> >Yes.  One of our goals is to have a defined storage engine API with
> >required and optional features to add new data sources.  In fact, we have
> >DRILL-16 which is dependent on DRILL-13 which specifically outlines this
> >goal.  DRILL-13 is the base API and DRILL-16 is the Cassandra
> >implementation.  Depending on your level of interest and time, we would
> >love to have some help on DRILL-13.
> >
> >>
> >> I assume that performance would be terrible unless somehow the data is
> >> stored using the columnar data format from the Dremel paper.  Is that
> >> accurate?  Does anyone know if anyone has attempted a translation of
> >> that format to Cassandra?
> >>
> >> One of the visions behind Dremel and Drill are that full table scans are
> >okay.  Part of the reason is the compact format of the data and the fact
> >that you only read important columns.  I'd expect that for many schema
> >designs, insitu-querying of Cassandra could be pretty effective.
> >
> >One of the things we've talked about is supporting caching
> >transformations.
> > E.g. the first time you query a source, it may be automatically
> >reorganized in a more efficient format.  This works really well with
> >HDFS's
> >write-once scheme.  Harder with something like Cassandra depending on how
> >your using it.
> >
> >
> >
> >> Regardless, I'm very interested in getting involved and no stranger to
> >> getting my hands dirty.
> >> Let me know if you can provide any direction. (our entities are
> >> currently stored in JSON in Cassandra)
> >>
> >>
> >As mentioned above, if you wanted to start a discussion and work on
> >DRILL-13, that would be very helpful.  Since we're still very much in
> >alpha
> >development right now, another helpful item would be to document your
> >rough
> >schema, available secondary indexes and example queries/needs on the wiki.
> > You could then translate those into Drill Logical plan syntax.  We could
> >use these as earlier test cases to ensure the system will support these
> >effectively.
> >
> >
> >Welcome,
> >
> >Jacques
> >
> >
> >
> >> -brian
> >>
> >>
> >> --
> >> Brian ONeill
> >> Lead Architect, Health Market Science (http://healthmarketscience.com)
> >> mobile:215.588.6024
> >> blog: http://brianoneill.blogspot.com/
> >> twitter: @boneill42
> >>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message