spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Ridley <mrid...@cloudera.com>
Subject Re: [Discuss] - Future plans for Spot-ingest
Date Mon, 17 Apr 2017 21:16:50 GMT
Also agree with that.  This is where my knowledge could be better, but I
know I have seen it in the past where some Parquet files were written in a
way that Impala could read them but Hive could not.  As much as possible it
seems good to avoid that.  I'm not sure what the least common
denominator/safest way to write would be.  Maybe you have more experience
with that, Mark?

Regardless of what writes the Parquet files I think it's not a bad design
goal to have them be as broadly readable as possible (eg. if we do use
Impala to write them, it would be good to have them still readable by
Spark, Hive, etc.)

Michael

On Mon, Apr 17, 2017 at 4:12 PM, Mark Grover <mark@apache.org> wrote:

> Thanks all your opinion.
>
> I think it's good to consider two things:
> 1. What do (we think) users care about?
> 2. What's the cost of changing things?
>
> About #1, I think users care more about what format data is written than
> how the data is written. I'd argue whether that uses Hive, MR, or a custom
> Parquet writer is not as important to them as long as we maintain
> data/format compatibility.
> About #2, having worked on several projects, I find that it's rather
> difficult to keep up with Parquet. Even in Spark, there are a few different
> ways to write to Parquet - there's a regular mode, and a legacy mode
> <https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/execution/datasources/parquet/
> ParquetWriteSupport.scala#L44>
> which
> continues to cause confusion
> <https://issues.apache.org/jira/browse/SPARK-20297> till date. Parquet
> itself is pretty dependent on Hadoop
> <https://github.com/Parquet/parquet-mr/search?l=Maven+POM&
> q=hadoop&type=&utf8=%E2%9C%93>
> and,
> just integrating it with systems with a lot of developers (like Spark
> <https://www.google.com/webhp?sourceid=chrome-instant&ion=1&
> espv=2&ie=UTF-8#q=spark+parquet+jiras>)
> is still a lot of work.
>
> I personally think we should leverage higher level tools like Hive, or
> Spark to write data in widespread formats (Parquet, being a very good
> example) but I personally wouldn't encourage us to manage the writers
> ourselves.
>
> Thoughts?
> Mark
>
> On Mon, Apr 17, 2017 at 11:44 AM, Michael Ridley <mridley@cloudera.com>
> wrote:
>
> > Without having given it too terribly much thought, that seems like an OK
> > approach.
> >
> > Michael
> >
> > On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith <nathanael@apache.org>
> > wrote:
> >
> > > I think the question is rather we can write the data generically to
> HDFS
> > > as parquet without the use of hive/impala?
> > >
> > > Today we write parquet data using the hive/mapreduce method.
> > > As part of the redesign i’d like to use libraries for this as opposed
> to
> > a
> > > hadoop dependency.
> > > I think it would be preferred to use the python master to write the
> data
> > > into the format we want, then do normalization of the data in spark
> > > streaming.
> > > Any thoughts?
> > >
> > > - Nathanael
> > >
> > >
> > >
> > > > On Apr 17, 2017, at 11:08 AM, Michael Ridley <mridley@cloudera.com>
> > > wrote:
> > > >
> > > > I had thought that the plan was to write the data in Parquet in HDFS
> > > > ultimately.
> > > >
> > > > Michael
> > > >
> > > > On Sun, Apr 16, 2017 at 11:55 AM, kant kodali <kanth909@gmail.com>
> > > wrote:
> > > >
> > > >> Hi Mark,
> > > >>
> > > >> Thank you so much for hearing my argument. And I definetly
> understand
> > > that
> > > >> you guys have bunch of things to do. My only concern is that I hope
> it
> > > >> doesn't take too long too support other backends. For example
> @Kenneth
> > > had
> > > >> given an example of LAMP stack had not moved away from mysql yet
> which
> > > >> essentially means its probably a decade ? I see that in the current
> > > >> architecture the results from with python multiprocessing or Spark
> > > >> Streaming are written back to HDFS and  If so, can we write them in
> > > parquet
> > > >> format ? such that users should be able to plug in any query engine
> > but
> > > >> again I am not pushing you guys to do this right away or anything
> just
> > > >> seeing if there a way for me to get started in parallel and if not
> > > >> feasible, its fine I just wanted to share my 2 cents and I am glad
> my
> > > >> argument is heard!
> > > >>
> > > >> Thanks much!
> > > >>
> > > >> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <mark@apache.org>
> wrote:
> > > >>
> > > >>> Hi Kant,
> > > >>> Just wanted to make sure you don't feel like we are ignoring your
> > > >>> comment:-) I hear you about pluggability.
> > > >>>
> > > >>> The design can and should be pluggable but the project has one
> stack
> > it
> > > >>> ships out of the box with, one stack that's the default stack
in
> the
> > > >> sense
> > > >>> that it's the most tested and so on. And, for us, that's our
> current
> > > >> stack.
> > > >>> If we were to take Apache Hive as an example, it shipped (and
> ships)
> > > with
> > > >>> MapReduce as the default configuration engine. At some point,
> Apache
> > > Tez
> > > >>> came along and wanted Hive to run on Tez, so they made a bunch
of
> > > things
> > > >>> pluggable to run Hive on Tez (instead of the only option up-until
> > then:
> > > >>> Hive-on-MR) and then Apache Spark came and re-used some of that
> > > >>> pluggability and even added some more so Hive-on-Spark could
> become a
> > > >>> reality. In the same way, I don't think anyone disagrees here
that
> > > >>> pluggabilty is a good thing but it's hard to do pluggability right,
> > and
> > > >> at
> > > >>> the right level, unless on has a clear use-case in mind.
> > > >>>
> > > >>> As a project, we have many things to do and I personally think
the
> > > >> biggest
> > > >>> bang for the buck for us in making Spot a really solid and the
best
> > > cyber
> > > >>> security solution isn't pluggability but the things we are working
> on
> > > - a
> > > >>> better user interface, a common/unified approach to storing and
> > > modeling
> > > >>> data, etc.
> > > >>>
> > > >>> Having said that, we are open, if it's important to you or someone
> > > else,
> > > >>> we'd be happy to receive and review those patches.
> > > >>>
> > > >>> Thanks!
> > > >>> Mark
> > > >>>
> > > >>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali <kanth909@gmail.com>
> > > >> wrote:
> > > >>>
> > > >>>> Thanks Ross! and yes option C sounds good to me as well however
I
> > just
> > > >>>> think Distributed Sql query engine  and the resource manager
> should
> > be
> > > >>>> pluggable.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <
> > alan.d.ross@intel.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Option C is to use python on the front end of ingest pipeline
and
> > > >>>>> spark/scala on the back end.
> > > >>>>>
> > > >>>>> Option A uses python workers on the backend
> > > >>>>>
> > > >>>>> Option B uses all scala.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> -----Original Message-----
> > > >>>>> From: kant kodali [mailto:kanth909@gmail.com]
> > > >>>>> Sent: Friday, April 14, 2017 9:53 AM
> > > >>>>> To: dev@spot.incubator.apache.org
> > > >>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
> > > >>>>>
> > > >>>>> What is option C ? am I missing an email or something?
> > > >>>>>
> > > >>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai
<
> > > >>>>> chokha@integralops.com> wrote:
> > > >>>>>
> > > >>>>>> +1 for Python 3.x
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote:
> > > >>>>>>
> > > >>>>>>> I think that C is the strong solution, getting
the ingest
> really
> > > >>>>>>> strong is going to lower barriers to adoption.
Doing it in
> Python
> > > >>>>>>> will open up the ingest portion of the project
to include many
> > > >> more
> > > >>>>> developers.
> > > >>>>>>>
> > > >>>>>>> Before it comes up I would like to throw the following
on the
> > > >>> pile...
> > > >>>>>>> Major
> > > >>>>>>> python projects django/flash, others are dropping
2.x support
> in
> > > >>>>>>> releases scheduled in the next 6 to 8 months.
Hadoop projects
> in
> > > >>>>>>> general tend to lag in modern python support,
lets please build
> > > >> this
> > > >>>>>>> in 3.5 so that we don't have to immediately expect
a rebuild in
> > > >> the
> > > >>>>>>> pipeline.
> > > >>>>>>>
> > > >>>>>>> -Vote C
> > > >>>>>>>
> > > >>>>>>> Thanks Nate
> > > >>>>>>>
> > > >>>>>>> Austin
> > > >>>>>>>
> > > >>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <alan@apache.org>
> > > >> wrote:
> > > >>>>>>>
> > > >>>>>>> I really like option C because it gives a lot
of flexibility
> for
> > > >>>>>>> ingest
> > > >>>>>>>> (python vs scala) but still has the robust
spark streaming
> > > >> backend
> > > >>>>>>>> for performance.
> > > >>>>>>>>
> > > >>>>>>>> Thanks for putting this together Nate.
> > > >>>>>>>>
> > > >>>>>>>> Alan
> > > >>>>>>>>
> > > >>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai
<
> > > >>>>>>>> chokha@integralops.com> wrote:
> > > >>>>>>>>
> > > >>>>>>>> I agree. We should continue making the existing
stack more
> > mature
> > > >>> at
> > > >>>>>>>>> this point. Maybe if we have enough community
support we can
> > add
> > > >>>>>>>>> additional datastores.
> > > >>>>>>>>>
> > > >>>>>>>>> Chokha.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On 4/14/17 11:10 AM, kenneth@floss.cat
wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Hi Kant,
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> YARN is the standard scheduler in
Hadoop. If you're using
> > > >>>>>>>>>> Hive+Spark, then sure you'll have
YARN.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Haven't seen any HIVE on Mesos so
far. As said, Spot is
> based
> > > >> on
> > > >>> a
> > > >>>>>>>>>> quite standard Hadoop stack and I
wouldn't switch too many
> > > >> pieces
> > > >>>>> yet.
> > > >>>>>>>>>>
> > > >>>>>>>>>> In most Opensource projects you start
relying on a
> well-known
> > > >>>>>>>>>> stack and then you begin to support
other DB backends once
> > it's
> > > >>>>>>>>>> quite mature. Think in the loads of
LAMP apps which haven't
> > > >> been
> > > >>>>>>>>>> ported away from MySQL yet.
> > > >>>>>>>>>>
> > > >>>>>>>>>> In any case, you'll need a high performance
SQL + Massive
> > > >> Storage
> > > >>>>>>>>>> + Machine Learning + Massive Ingestion,
and... ATM, that can
> > be
> > > >>>>>>>>>> only provided by Hadoop.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Regards!
> > > >>>>>>>>>>
> > > >>>>>>>>>> Kenneth
> > > >>>>>>>>>>
> > > >>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Hi Kenneth,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks for the response.  I think
you made a case for HDFS
> > > >>>>>>>>>>> however users may want to use
S3 or some other FS in which
> > > >> case
> > > >>>>>>>>>>> they can use Auxilio (hoping that
there are no changes
> needed
> > > >>>>>>>>>>> within Spot in which case I
> > > >>>>>>>>>>>
> > > >>>>>>>>>> can
> > > >>>>>>>>
> > > >>>>>>>>> agree to that). for example, Netflix stores
all there data
> into
> > > >> S3
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> The distributed sql query engine
I would say should be
> > > >> pluggable
> > > >>>>>>>>>>> with whatever user may want to
use and there a bunch of
> them
> > > >> out
> > > >>>>> there.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> sure
> > > >>>>>>>>
> > > >>>>>>>>> Impala is better than hive but what if
users are already
> using
> > > >>>>>>>>>>>
> > > >>>>>>>>>> something
> > > >>>>>>>>
> > > >>>>>>>>> else like Drill or Presto?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Me personally, would not assume
that users are willing to
> > > >> deploy
> > > >>>>>>>>>>> all
> > > >>>>>>>>>>>
> > > >>>>>>>>>> of
> > > >>>>>>>>
> > > >>>>>>>>> that and make their existing stack more
complicated at very
> > > >> least
> > > >>> I
> > > >>>>>>>>>>> would
> > > >>>>>>>>>>> say it is a uphill battle. Things
have been changing
> rapidly
> > > >> in
> > > >>>>>>>>>>> Big
> > > >>>>>>>>>>>
> > > >>>>>>>>>> data
> > > >>>>>>>>
> > > >>>>>>>>> space so whatever we think is standard
won't be standard
> > anymore
> > > >>>>>>>>> but
> > > >>>>>>>>>>> importantly there shouldn't be
any reason why we shouldn't
> be
> > > >>>>>>>>>>> flexible right.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Also I am not sure why only YARN?
why not make that also
> more
> > > >>>>>>>>>>> flexible so users can pick Mesos
or standalone.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I think Flexibility is a key for
a wide adoption rather
> than
> > > >> the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> tightly
> > > >>>>>>>>
> > > >>>>>>>>> coupled architecture.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks!
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM,
Kenneth Peiruza
> > > >>>>>>>>>>> <kenneth@floss.cat>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> PS: you need a big data platform
to be able to collect all
> > > >> those
> > > >>>>>>>>>>>> netflows
> > > >>>>>>>>>>>> and logs.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Spot isn't intended for SMBs,
that's clear, then you need
> > > >> loads
> > > >>>>>>>>>>>> of data to get ML working
properly, and somewhere to run
> > > >> those
> > > >>>>>>>>>>>> algorithms. That
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> is
> > > >>>>>>>>
> > > >>>>>>>>> Hadoop.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Regards!
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Kenneth
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Sent from my Mi phone
> > > >>>>>>>>>>>> On kant kodali <kanth909@gmail.com>,
Apr 14, 2017 4:04 AM
> > > >>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thanks for starting this thread.
Here is my feedback.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> I somehow think the architecture
is too complicated for
> wide
> > > >>>>>>>>>>>> adoption since it requires
to install the following.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> HDFS.
> > > >>>>>>>>>>>> HIVE.
> > > >>>>>>>>>>>> IMPALA.
> > > >>>>>>>>>>>> KAFKA.
> > > >>>>>>>>>>>> SPARK (YARN).
> > > >>>>>>>>>>>> YARN.
> > > >>>>>>>>>>>> Zookeeper.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Currently there are way too
many dependencies that
> > > >> discourages
> > > >>>>>>>>>>>> lot of users from using it
because they have to go through
> > > >>>>>>>>>>>> deployment of all that required
software. I think for wide
> > > >>>>>>>>>>>> option we should minimize
the dependencies and have more
> > > >>>>>>>>>>>> pluggable architecture. for
example I am
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> not
> > > >>>>>>>>
> > > >>>>>>>>> sure why HIVE & IMPALA both are required?
why not just use
> > Spark
> > > >>>>>>>>> SQL
> > > >>>>>>>>>>>> since
> > > >>>>>>>>>>>> its already dependency or
say users may want to use their
> > own
> > > >>>>>>>>>>>> distributed query engine they
like such as Apache Drill or
> > > >>>>>>>>>>>> something else. we should
be flexible enough to provide
> that
> > > >>>>>>>>>>>> option
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Also, I see that HDFS is used
such that collectors can
> > > >> receive
> > > >>>>>>>>>>>> file path's through Kafka
and be able to read a file. How
> > big
> > > >>>>>>>>>>>> are these files ?
> > > >>>>>>>>>>>> Do we
> > > >>>>>>>>>>>> really need HDFS for this?
Why not provide more ways to
> send
> > > >>>>>>>>>>>> data such as sending data
directly through Kafka or say
> just
> > > >>>>>>>>>>>> leaving up to the user to
specify the file location as an
> > > >>>>>>>>>>>> argument to collector process
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Finally, I learnt that to
generate Net flow data one would
> > > >>>>>>>>>>>> require a specific hardware.
This really means Apache Spot
> > is
> > > >>>>>>>>>>>> not meant for everyone.
> > > >>>>>>>>>>>> I thought Apache Spot can
be used to analyze the network
> > > >>> traffic
> > > >>>>>>>>>>>> of
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> any
> > > >>>>>>>>
> > > >>>>>>>>> machine but if it requires a specific
hard then I think it is
> > > >>>>>>>>>>>> targeted for
> > > >>>>>>>>>>>> specific group of people.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> The real strength of Apache
Spot should mainly be just
> > > >>> analyzing
> > > >>>>>>>>>>>> network traffic through ML.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thanks!
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28
PM, Segerlind, Nathan L <
> > > >>>>>>>>>>>> nathan.l.segerlind@intel.com>
wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thanks, Nate,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Nate.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> -----Original Message-----
> > > >>>>>>>>>>>>> From: Nate Smith [mailto:natedogs911@gmail.com]
> > > >>>>>>>>>>>>> Sent: Thursday, April
13, 2017 4:26 PM
> > > >>>>>>>>>>>>> To: user@spot.incubator.apache.org
> > > >>>>>>>>>>>>> Cc: dev@spot.incubator.apache.org;
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> private@spot.incubator.apache.org
> > > >>>>>>>>
> > > >>>>>>>>> Subject: Re: [Discuss] - Future plans
for Spot-ingest
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> I was really hoping it
came through ok, Oh well :) Here’s
> > an
> > > >>>>>>>>>>>>> image form:
> > > >>>>>>>>>>>>> http://imgur.com/a/DUDsD
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Apr 13, 2017, at 4:05
PM, Segerlind, Nathan L <
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> nathan.l.segerlind@intel.com>
wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> The diagram became
garbled in the text format.
> > > >>>>>>>>>>>>>> Could you resend it
as a pdf?
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>> Nate
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> -----Original Message-----
> > > >>>>>>>>>>>>>> From: Nathanael Smith
[mailto:nathanael@apache.org]
> > > >>>>>>>>>>>>>> Sent: Thursday, April
13, 2017 4:01 PM
> > > >>>>>>>>>>>>>> To: private@spot.incubator.apache.org;
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> dev@spot.incubator.apache.org;
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> user@spot.incubator.apache.org
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Subject: [Discuss]
- Future plans for Spot-ingest
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> How would you like
to see Spot-ingest change?
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> A. continue development
on the Python Master/Worker with
> > > >>> focus
> > > >>>>>>>>>>>>>> on
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> performance / error handling
/ logging B. Develop Scala
> > > >> based
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>> ingest to
> > > >>>>>>>>>>>> be
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> inline with code base
from ingest, ml, to OA (UI to
> > continue
> > > >>>>>>>>>>>>> being
> > > >>>>>>>>>>>>> ipython/JS) C. Python
ingest Worker with Scala based
> Spark
> > > >>> code
> > > >>>>>>>>>>>>> for normalization and
input into DB
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Including the high
level diagram:
> > > >>>>>>>>>>>>>> +-----------------------------
> > > >> ------------------------------
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> -------------------------------+
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> | +--------------------------+
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> +-----------------+  
     |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> | | Master       
           |  A. B. C.
> > > >>>>  |
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Worker          |    
   |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> | |    A. Python 
           +---------------+      A.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> |   A.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Python     |        |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> | |    B. Scala  
           |               |
> > > >>>> +------------->
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>          +----+   |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> | |    C. Python 
           |               |    |
> > > >>>> |
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>          |    |   |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> | +---^------+---------------+
              |    |
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>  +-----------------+ 
  |   |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> |     |      |   
                           |    |
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>               |   |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> |     |      |   
                           |    |
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>               |   |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> |     |     +Note--------------+
            |    |
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>  +-----------------+ 
  |   |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> |     |     |Running
on a      |             |    |
> > > >>>> |
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Spark
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Streaming |    |   |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> |     |     |worker
node in    |             |    |
> > B.
> > > >>> C.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> | B.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Scala        |    |  
|
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> |     |     |the Hadoop
cluster|             |    |
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> +--------> C.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Scala        +-+  |  
|
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> |     |     +------------------+
            |    |    |
> > > >>>>  |
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>          | |  |   |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> |   A.|          
                           |    |    |
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> +-----------------+ |
 |   |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> |   B.|          
                           |    |    |
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>             |  |   |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> |   C.|          
                           |    |    |
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>             |  |   |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> | +----------------------+
> +-v------+----+----+-+
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>  +--------------v--v-+
|
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> | |              
       |          |
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> |           |
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>>                  | |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> | |   Local FS:  
       |          |    hdfs
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> |           |
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Hive / Impala    | |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> | |  - Binary/Text
      |          |
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> |           |
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>>  - Parquet -     | |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> | |    Log files -
      |          |
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> |           |
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>>                  | |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> | |              
       |          |
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> |           |
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>>                  | |
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> | +----------------------+
> +--------------------+
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>  +-------------------+
|
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> +-----------------------------
> > > >> ------------------------------
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>> -------------------------------+
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Please let me know
your thoughts,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> - Nathanael
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Michael Ridley <mridley@cloudera.com>
> > > > office: (650) 352-1337
> > > > mobile: (571) 438-2420
> > > > Senior Solutions Architect
> > > > Cloudera, Inc.
> > >
> > >
> >
> >
> > --
> > Michael Ridley <mridley@cloudera.com>
> > office: (650) 352-1337
> > mobile: (571) 438-2420
> > Senior Solutions Architect
> > Cloudera, Inc.
> >
>



-- 
Michael Ridley <mridley@cloudera.com>
office: (650) 352-1337
mobile: (571) 438-2420
Senior Solutions Architect
Cloudera, Inc.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message