drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian O'Neill <b...@alumni.brown.edu>
Subject Re: High-level architecture
Date Thu, 31 Jan 2013 20:43:04 GMT
Great points. Thanks Ted.

I'm not sure if it is possible, but if there were a Storm topology deployment option, I think
there might be appetite for that since it would reduce the operations/admin complexity significantly
for consumers that already have Storm deployed.  (IMHO) I would be willing to sacrifice some
performance to maintain only one set of distributed processing infrastructure.  

With respect to locality information, I think Storm will eventually need to add out-of-band
information to optimize the tuple routing.  We developed the storm-cassandra bolt, and I'm
eager to get to the point where we can supply ring/token information to Storm so it can route
the tuples to the nodes that contain the data.

(Maybe it gets carried around in the tuple and leveraged by the underlying infrastructure
-- much like Nathan did with transaction id for Trident?)

But I fully appreciate your points. (especially regarding java-centricity, serialization,
kryo, etc.)


Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive • King of Prussia, PA • 19406
M: 215.588.6024 • @boneill42  • healthmarketscience.com

On Jan 30, 2013, at 3:16 PM, Ted Dunning wrote:

> On Wed, Jan 30, 2013 at 11:53 AM, Brian O'Neill <bone@alumni.brown.edu>wrote:
>> ...
>> How do we intend to distribute the execution engine across a set of
>> machines?
> There are a variety of thoughts.  These include:
> - custom built execution controller similar to Storm's Nimbus
> - use Storm's Nimbus
> - use the custom built controller via Yarn.  Or Mesos.  Or the MapR warden
> - start them by hand.
> Obviously the last option will be the one that is used in initial testing.
> Any thought to deploying the engine as a Storm topology?
> Using Storm probably limits the performance that we can get.  Storm's
> performance is creditable but not super awesomely impressive.
> Some of the performance issues with Storm include:
> - limited to Java.  This may or may not make a difference in the end in
> terms of performance, but we definitely want flexibility here.  Java can be
> awesomely fast (see LMAX and Disruptor), but C++ may be more predictable.
> We definitely *don't* want to decide for all time right now which option
> we take and we definitely *do* want to have the C++ option in our
> hip-pocket later regardless of how we build execution engines now.  Part of
> Storm's limitations here have to do with the use of Kryo instead of a
> portable serializer like protobufs.
> - the designs I have seen or heard batting around tend to deal with batches
> of records represented in an ephemeral column oriented design.  It will
> also be important for records to be kept in unmaterialized, virtual form to
> minimize copying, especially when flattening arrays and such.  Storm allows
> tuples to be kept in memory when bolts are on the same machine, but insists
> on serializing and deserializing them at the frontier.  To control this, we
> would have to nest serializations which seems a bit like incipient insanity.
> Other issues include:
> - Drill execution engines will need access to a considerable amount of
> out-of-band information such as schemas and statistics.  How do we manage
> that in a restrictive paradigm like Storm
> - Storm hides location from Bolts.  Drill needs to make decisions based on
> location of execution engines and data.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message