spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Ross <a...@apache.org>
Subject Re: [Discuss] - Future plans for Spot-ingest
Date Fri, 14 Apr 2017 15:51:56 GMT
I really like option C because it gives a lot of flexibility for ingest
(python vs scala) but still has the robust spark streaming backend for
performance.

Thanks for putting this together Nate.

Alan

On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
chokha@integralops.com> wrote:

> I agree. We should continue making the existing stack more mature at
> this point. Maybe if we have enough community support we can add
> additional datastores.
>
> Chokha.
>
>
> On 4/14/17 11:10 AM, kenneth@floss.cat wrote:
> > Hi Kant,
> >
> >
> > YARN is the standard scheduler in Hadoop. If you're using Hive+Spark,
> > then sure you'll have YARN.
> >
> > Haven't seen any HIVE on Mesos so far. As said, Spot is based on a
> > quite standard Hadoop stack and I wouldn't switch too many pieces yet.
> >
> > In most Opensource projects you start relying on a well-known stack
> > and then you begin to support other DB backends once it's quite
> > mature. Think in the loads of LAMP apps which haven't been ported away
> > from MySQL yet.
> >
> > In any case, you'll need a high performance SQL + Massive Storage +
> > Machine Learning + Massive Ingestion, and... ATM, that can be only
> > provided by Hadoop.
> >
> > Regards!
> >
> > Kenneth
> >
> > A 2017-04-14 12:56, kant kodali escrigué:
> >> Hi Kenneth,
> >>
> >> Thanks for the response.  I think you made a case for HDFS  however
> >> users
> >> may want to use S3 or some other FS in which case they can use Auxilio
> >> (hoping that there are no changes needed within Spot in which case I can
> >> agree to that). for example, Netflix stores all there data into S3
> >>
> >> The distributed sql query engine I would say should be pluggable with
> >> whatever user may want to use and there a bunch of them out there. sure
> >> Impala is better than hive but what if users are already using something
> >> else like Drill or Presto?
> >>
> >> Me personally, would not assume that users are willing to deploy all of
> >> that and make their existing stack more complicated at very least I
> >> would
> >> say it is a uphill battle. Things have been changing rapidly in Big data
> >> space so whatever we think is standard won't be standard anymore but
> >> importantly there shouldn't be any reason why we shouldn't be flexible
> >> right.
> >>
> >> Also I am not sure why only YARN? why not make that also more
> >> flexible so
> >> users can pick Mesos or standalone.
> >>
> >> I think Flexibility is a key for a wide adoption rather than the tightly
> >> coupled architecture.
> >>
> >> Thanks!
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza <kenneth@floss.cat>
> >> wrote:
> >>
> >>> PS: you need a big data platform to be able to collect all those
> >>> netflows
> >>> and logs.
> >>>
> >>> Spot isn't intended for SMBs, that's clear, then you need loads of
> >>> data to
> >>> get ML working properly, and somewhere to run those algorithms. That is
> >>> Hadoop.
> >>>
> >>> Regards!
> >>>
> >>> Kenneth
> >>>
> >>>
> >>>
> >>> Sent from my Mi phone
> >>> On kant kodali <kanth909@gmail.com>, Apr 14, 2017 4:04 AM wrote:
> >>>
> >>> Hi,
> >>>
> >>> Thanks for starting this thread. Here is my feedback.
> >>>
> >>> I somehow think the architecture is too complicated for wide adoption
> >>> since
> >>> it requires to install the following.
> >>>
> >>> HDFS.
> >>> HIVE.
> >>> IMPALA.
> >>> KAFKA.
> >>> SPARK (YARN).
> >>> YARN.
> >>> Zookeeper.
> >>>
> >>> Currently there are way too many dependencies that discourages lot of
> >>> users
> >>> from using it because they have to go through deployment of all that
> >>> required software. I think for wide option we should minimize the
> >>> dependencies and have more pluggable architecture. for example I am not
> >>> sure why HIVE & IMPALA both are required? why not just use Spark SQL
> >>> since
> >>> its already dependency or say users may want to use their own
> >>> distributed
> >>> query engine they like such as Apache Drill or something else. we
> >>> should
> >>> be
> >>> flexible enough to provide that option
> >>>
> >>> Also, I see that HDFS is used such that collectors can receive file
> >>> path's
> >>> through Kafka and be able to read a file. How big are these files ?
> >>> Do we
> >>> really need HDFS for this? Why not provide more ways to send data
> >>> such as
> >>> sending data directly through Kafka or say just leaving up to the
> >>> user to
> >>> specify the file location as an argument to collector process
> >>>
> >>> Finally, I learnt that to generate Net flow data one would require a
> >>> specific hardware. This really means Apache Spot is not meant for
> >>> everyone.
> >>> I thought Apache Spot can be used to analyze the network traffic of any
> >>> machine but if it requires a specific hard then I think it is
> >>> targeted for
> >>> specific group of people.
> >>>
> >>> The real strength of Apache Spot should mainly be just analyzing
> >>> network
> >>> traffic through ML.
> >>>
> >>> Thanks!
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L <
> >>> nathan.l.segerlind@intel.com> wrote:
> >>>
> >>> > Thanks, Nate,
> >>> >
> >>> > Nate.
> >>> >
> >>> >
> >>> > -----Original Message-----
> >>> > From: Nate Smith [mailto:natedogs911@gmail.com]
> >>> > Sent: Thursday, April 13, 2017 4:26 PM
> >>> > To: user@spot.incubator.apache.org
> >>> > Cc: dev@spot.incubator.apache.org; private@spot.incubator.apache.org
> >>> > Subject: Re: [Discuss] - Future plans for Spot-ingest
> >>> >
> >>> > I was really hoping it came through ok,
> >>> > Oh well :)
> >>> > Here’s an image form:
> >>> > http://imgur.com/a/DUDsD
> >>> >
> >>> >
> >>> > > On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L <
> >>> > nathan.l.segerlind@intel.com> wrote:
> >>> > >
> >>> > > The diagram became garbled in the text format.
> >>> > > Could you resend it as a pdf?
> >>> > >
> >>> > > Thanks,
> >>> > > Nate
> >>> > >
> >>> > > -----Original Message-----
> >>> > > From: Nathanael Smith [mailto:nathanael@apache.org]
> >>> > > Sent: Thursday, April 13, 2017 4:01 PM
> >>> > > To: private@spot.incubator.apache.org;
> >>> dev@spot.incubator.apache.org;
> >>> > user@spot.incubator.apache.org
> >>> > > Subject: [Discuss] - Future plans for Spot-ingest
> >>> > >
> >>> > > How would you like to see Spot-ingest change?
> >>> > >
> >>> > > A. continue development on the Python Master/Worker with focus
on
> >>> > performance / error handling / logging B. Develop Scala based
> >>> ingest to
> >>> be
> >>> > inline with code base from ingest, ml, to OA (UI to continue being
> >>> > ipython/JS) C. Python ingest Worker with Scala based Spark code for
> >>> > normalization and input into DB
> >>> > >
> >>> > > Including the high level diagram:
> >>> > > +-----------------------------------------------------------
> >>> > -------------------------------+
> >>> > > | +--------------------------+
> >>> > +-----------------+        |
> >>> > > | | Master                   |  A. B. C.                     
  |
> >>> > Worker          |        |
> >>> > > | |    A. Python             +---------------+      A.
> >>> |   A.
> >>> > Python     |        |
> >>> > > | |    B. Scala              |               |    +------------->
> >>> >          +----+   |
> >>> > > | |    C. Python             |               |    |          
  |
> >>> >          |    |   |
> >>> > > | +---^------+---------------+               |    |
> >>> >  +-----------------+    |   |
> >>> > > |     |      |                               |    |
> >>> >               |   |
> >>> > > |     |      |                               |    |
> >>> >               |   |
> >>> > > |     |     +Note--------------+             |    |
> >>> >  +-----------------+    |   |
> >>> > > |     |     |Running on a      |             |    |          
  |
> >>> Spark
> >>> > Streaming |    |   |
> >>> > > |     |     |worker node in    |             |    |      B. C.
> >>> | B.
> >>> > Scala        |    |   |
> >>> > > |     |     |the Hadoop cluster|             |    |
> >>> +--------> C.
> >>> > Scala        +-+  |   |
> >>> > > |     |     +------------------+             |    |    |     
  |
> >>> >          | |  |   |
> >>> > > |   A.|                                      |    |    |
> >>> > +-----------------+ |  |   |
> >>> > > |   B.|                                      |    |    |
> >>> >             |  |   |
> >>> > > |   C.|                                      |    |    |
> >>> >             |  |   |
> >>> > > | +----------------------+          +-v------+----+----+-+
> >>> >  +--------------v--v-+ |
> >>> > > | |                      |          |
> >>> |           |
> >>> >                  | |
> >>> > > | |   Local FS:          |          |    hdfs
> >>> |           |
> >>> > Hive / Impala    | |
> >>> > > | |  - Binary/Text       |          |
> >>> |           |
> >>> >  - Parquet -     | |
> >>> > > | |    Log files -       |          |
> >>> |           |
> >>> >                  | |
> >>> > > | |                      |          |
> >>> |           |
> >>> >                  | |
> >>> > > | +----------------------+          +--------------------+
> >>> >  +-------------------+ |
> >>> > > +-----------------------------------------------------------
> >>> > -------------------------------+
> >>> > >
> >>> > > Please let me know your thoughts,
> >>> > >
> >>> > > - Nathanael
> >>> > >
> >>> > >
> >>> > >
> >>> >
> >>> >
> >>>
> >>>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message