spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kenn...@floss.cat
Subject Re: [Discuss] - Future plans for Spot-ingest
Date Fri, 14 Apr 2017 17:50:19 GMT
+1


A 2017-04-14 17:59, Austin Leahy escrigué:
> I think that C is the strong solution, getting the ingest really strong 
> is
> going to lower barriers to adoption. Doing it in Python will open up 
> the
> ingest portion of the project to include many more developers.
> 
> Before it comes up I would like to throw the following on the pile... 
> Major
> python projects django/flash, others are dropping 2.x support in 
> releases
> scheduled in the next 6 to 8 months. Hadoop projects in general tend to 
> lag
> in modern python support, lets please build this in 3.5 so that we 
> don't
> have to immediately expect a rebuild in the pipeline.
> 
> -Vote C
> 
> Thanks Nate
> 
> Austin
> 
> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <alan@apache.org> wrote:
> 
>> I really like option C because it gives a lot of flexibility for 
>> ingest
>> (python vs scala) but still has the robust spark streaming backend for
>> performance.
>> 
>> Thanks for putting this together Nate.
>> 
>> Alan
>> 
>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
>> chokha@integralops.com> wrote:
>> 
>> > I agree. We should continue making the existing stack more mature at
>> > this point. Maybe if we have enough community support we can add
>> > additional datastores.
>> >
>> > Chokha.
>> >
>> >
>> > On 4/14/17 11:10 AM, kenneth@floss.cat wrote:
>> > > Hi Kant,
>> > >
>> > >
>> > > YARN is the standard scheduler in Hadoop. If you're using Hive+Spark,
>> > > then sure you'll have YARN.
>> > >
>> > > Haven't seen any HIVE on Mesos so far. As said, Spot is based on a
>> > > quite standard Hadoop stack and I wouldn't switch too many pieces yet.
>> > >
>> > > In most Opensource projects you start relying on a well-known stack
>> > > and then you begin to support other DB backends once it's quite
>> > > mature. Think in the loads of LAMP apps which haven't been ported away
>> > > from MySQL yet.
>> > >
>> > > In any case, you'll need a high performance SQL + Massive Storage +
>> > > Machine Learning + Massive Ingestion, and... ATM, that can be only
>> > > provided by Hadoop.
>> > >
>> > > Regards!
>> > >
>> > > Kenneth
>> > >
>> > > A 2017-04-14 12:56, kant kodali escrigué:
>> > >> Hi Kenneth,
>> > >>
>> > >> Thanks for the response.  I think you made a case for HDFS  however
>> > >> users
>> > >> may want to use S3 or some other FS in which case they can use Auxilio
>> > >> (hoping that there are no changes needed within Spot in which case
I
>> can
>> > >> agree to that). for example, Netflix stores all there data into S3
>> > >>
>> > >> The distributed sql query engine I would say should be pluggable with
>> > >> whatever user may want to use and there a bunch of them out there.
>> sure
>> > >> Impala is better than hive but what if users are already using
>> something
>> > >> else like Drill or Presto?
>> > >>
>> > >> Me personally, would not assume that users are willing to deploy all
>> of
>> > >> that and make their existing stack more complicated at very least I
>> > >> would
>> > >> say it is a uphill battle. Things have been changing rapidly in Big
>> data
>> > >> space so whatever we think is standard won't be standard anymore but
>> > >> importantly there shouldn't be any reason why we shouldn't be flexible
>> > >> right.
>> > >>
>> > >> Also I am not sure why only YARN? why not make that also more
>> > >> flexible so
>> > >> users can pick Mesos or standalone.
>> > >>
>> > >> I think Flexibility is a key for a wide adoption rather than the
>> tightly
>> > >> coupled architecture.
>> > >>
>> > >> Thanks!
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza <kenneth@floss.cat>
>> > >> wrote:
>> > >>
>> > >>> PS: you need a big data platform to be able to collect all those
>> > >>> netflows
>> > >>> and logs.
>> > >>>
>> > >>> Spot isn't intended for SMBs, that's clear, then you need loads
of
>> > >>> data to
>> > >>> get ML working properly, and somewhere to run those algorithms.
That
>> is
>> > >>> Hadoop.
>> > >>>
>> > >>> Regards!
>> > >>>
>> > >>> Kenneth
>> > >>>
>> > >>>
>> > >>>
>> > >>> Sent from my Mi phone
>> > >>> On kant kodali <kanth909@gmail.com>, Apr 14, 2017 4:04 AM
wrote:
>> > >>>
>> > >>> Hi,
>> > >>>
>> > >>> Thanks for starting this thread. Here is my feedback.
>> > >>>
>> > >>> I somehow think the architecture is too complicated for wide adoption
>> > >>> since
>> > >>> it requires to install the following.
>> > >>>
>> > >>> HDFS.
>> > >>> HIVE.
>> > >>> IMPALA.
>> > >>> KAFKA.
>> > >>> SPARK (YARN).
>> > >>> YARN.
>> > >>> Zookeeper.
>> > >>>
>> > >>> Currently there are way too many dependencies that discourages
lot of
>> > >>> users
>> > >>> from using it because they have to go through deployment of all
that
>> > >>> required software. I think for wide option we should minimize the
>> > >>> dependencies and have more pluggable architecture. for example
I am
>> not
>> > >>> sure why HIVE & IMPALA both are required? why not just use
Spark SQL
>> > >>> since
>> > >>> its already dependency or say users may want to use their own
>> > >>> distributed
>> > >>> query engine they like such as Apache Drill or something else.
we
>> > >>> should
>> > >>> be
>> > >>> flexible enough to provide that option
>> > >>>
>> > >>> Also, I see that HDFS is used such that collectors can receive
file
>> > >>> path's
>> > >>> through Kafka and be able to read a file. How big are these files
?
>> > >>> Do we
>> > >>> really need HDFS for this? Why not provide more ways to send data
>> > >>> such as
>> > >>> sending data directly through Kafka or say just leaving up to the
>> > >>> user to
>> > >>> specify the file location as an argument to collector process
>> > >>>
>> > >>> Finally, I learnt that to generate Net flow data one would require
a
>> > >>> specific hardware. This really means Apache Spot is not meant for
>> > >>> everyone.
>> > >>> I thought Apache Spot can be used to analyze the network traffic
of
>> any
>> > >>> machine but if it requires a specific hard then I think it is
>> > >>> targeted for
>> > >>> specific group of people.
>> > >>>
>> > >>> The real strength of Apache Spot should mainly be just analyzing
>> > >>> network
>> > >>> traffic through ML.
>> > >>>
>> > >>> Thanks!
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L <
>> > >>> nathan.l.segerlind@intel.com> wrote:
>> > >>>
>> > >>> > Thanks, Nate,
>> > >>> >
>> > >>> > Nate.
>> > >>> >
>> > >>> >
>> > >>> > -----Original Message-----
>> > >>> > From: Nate Smith [mailto:natedogs911@gmail.com]
>> > >>> > Sent: Thursday, April 13, 2017 4:26 PM
>> > >>> > To: user@spot.incubator.apache.org
>> > >>> > Cc: dev@spot.incubator.apache.org;
>> private@spot.incubator.apache.org
>> > >>> > Subject: Re: [Discuss] - Future plans for Spot-ingest
>> > >>> >
>> > >>> > I was really hoping it came through ok,
>> > >>> > Oh well :)
>> > >>> > Here’s an image form:
>> > >>> > http://imgur.com/a/DUDsD
>> > >>> >
>> > >>> >
>> > >>> > > On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L <
>> > >>> > nathan.l.segerlind@intel.com> wrote:
>> > >>> > >
>> > >>> > > The diagram became garbled in the text format.
>> > >>> > > Could you resend it as a pdf?
>> > >>> > >
>> > >>> > > Thanks,
>> > >>> > > Nate
>> > >>> > >
>> > >>> > > -----Original Message-----
>> > >>> > > From: Nathanael Smith [mailto:nathanael@apache.org]
>> > >>> > > Sent: Thursday, April 13, 2017 4:01 PM
>> > >>> > > To: private@spot.incubator.apache.org;
>> > >>> dev@spot.incubator.apache.org;
>> > >>> > user@spot.incubator.apache.org
>> > >>> > > Subject: [Discuss] - Future plans for Spot-ingest
>> > >>> > >
>> > >>> > > How would you like to see Spot-ingest change?
>> > >>> > >
>> > >>> > > A. continue development on the Python Master/Worker with
focus on
>> > >>> > performance / error handling / logging B. Develop Scala based
>> > >>> ingest to
>> > >>> be
>> > >>> > inline with code base from ingest, ml, to OA (UI to continue
being
>> > >>> > ipython/JS) C. Python ingest Worker with Scala based Spark
code for
>> > >>> > normalization and input into DB
>> > >>> > >
>> > >>> > > Including the high level diagram:
>> > >>> > > +-----------------------------------------------------------
>> > >>> > -------------------------------+
>> > >>> > > | +--------------------------+
>> > >>> > +-----------------+        |
>> > >>> > > | | Master                   |  A. B. C.            
           |
>> > >>> > Worker          |        |
>> > >>> > > | |    A. Python             +---------------+      A.
>> > >>> |   A.
>> > >>> > Python     |        |
>> > >>> > > | |    B. Scala              |               |    +------------->
>> > >>> >          +----+   |
>> > >>> > > | |    C. Python             |               |    | 
           |
>> > >>> >          |    |   |
>> > >>> > > | +---^------+---------------+               |    |
>> > >>> >  +-----------------+    |   |
>> > >>> > > |     |      |                               |    |
>> > >>> >               |   |
>> > >>> > > |     |      |                               |    |
>> > >>> >               |   |
>> > >>> > > |     |     +Note--------------+             |    |
>> > >>> >  +-----------------+    |   |
>> > >>> > > |     |     |Running on a      |             |    | 
           |
>> > >>> Spark
>> > >>> > Streaming |    |   |
>> > >>> > > |     |     |worker node in    |             |    | 
    B. C.
>> > >>> | B.
>> > >>> > Scala        |    |   |
>> > >>> > > |     |     |the Hadoop cluster|             |    |
>> > >>> +--------> C.
>> > >>> > Scala        +-+  |   |
>> > >>> > > |     |     +------------------+             |    | 
  |        |
>> > >>> >          | |  |   |
>> > >>> > > |   A.|                                      |    | 
  |
>> > >>> > +-----------------+ |  |   |
>> > >>> > > |   B.|                                      |    | 
  |
>> > >>> >             |  |   |
>> > >>> > > |   C.|                                      |    | 
  |
>> > >>> >             |  |   |
>> > >>> > > | +----------------------+          +-v------+----+----+-+
>> > >>> >  +--------------v--v-+ |
>> > >>> > > | |                      |          |
>> > >>> |           |
>> > >>> >                  | |
>> > >>> > > | |   Local FS:          |          |    hdfs
>> > >>> |           |
>> > >>> > Hive / Impala    | |
>> > >>> > > | |  - Binary/Text       |          |
>> > >>> |           |
>> > >>> >  - Parquet -     | |
>> > >>> > > | |    Log files -       |          |
>> > >>> |           |
>> > >>> >                  | |
>> > >>> > > | |                      |          |
>> > >>> |           |
>> > >>> >                  | |
>> > >>> > > | +----------------------+          +--------------------+
>> > >>> >  +-------------------+ |
>> > >>> > > +-----------------------------------------------------------
>> > >>> > -------------------------------+
>> > >>> > >
>> > >>> > > Please let me know your thoughts,
>> > >>> > >
>> > >>> > > - Nathanael
>> > >>> > >
>> > >>> > >
>> > >>> > >
>> > >>> >
>> > >>> >
>> > >>>
>> > >>>
>> > >
>> >
>> >
>> 


Mime
View raw message