spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kant kodali <kanth...@gmail.com>
Subject Re: [Discuss] - Future plans for Spot-ingest
Date Sun, 16 Apr 2017 15:55:00 GMT
Hi Mark,

Thank you so much for hearing my argument. And I definetly understand that
you guys have bunch of things to do. My only concern is that I hope it
doesn't take too long too support other backends. For example @Kenneth had
given an example of LAMP stack had not moved away from mysql yet which
essentially means its probably a decade ? I see that in the current
architecture the results from with python multiprocessing or Spark
Streaming are written back to HDFS and  If so, can we write them in parquet
format ? such that users should be able to plug in any query engine but
again I am not pushing you guys to do this right away or anything just
seeing if there a way for me to get started in parallel and if not
feasible, its fine I just wanted to share my 2 cents and I am glad my
argument is heard!

Thanks much!

On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <mark@apache.org> wrote:

> Hi Kant,
> Just wanted to make sure you don't feel like we are ignoring your
> comment:-) I hear you about pluggability.
>
> The design can and should be pluggable but the project has one stack it
> ships out of the box with, one stack that's the default stack in the sense
> that it's the most tested and so on. And, for us, that's our current stack.
> If we were to take Apache Hive as an example, it shipped (and ships) with
> MapReduce as the default configuration engine. At some point, Apache Tez
> came along and wanted Hive to run on Tez, so they made a bunch of things
> pluggable to run Hive on Tez (instead of the only option up-until then:
> Hive-on-MR) and then Apache Spark came and re-used some of that
> pluggability and even added some more so Hive-on-Spark could become a
> reality. In the same way, I don't think anyone disagrees here that
> pluggabilty is a good thing but it's hard to do pluggability right, and at
> the right level, unless on has a clear use-case in mind.
>
> As a project, we have many things to do and I personally think the biggest
> bang for the buck for us in making Spot a really solid and the best cyber
> security solution isn't pluggability but the things we are working on - a
> better user interface, a common/unified approach to storing and modeling
> data, etc.
>
> Having said that, we are open, if it's important to you or someone else,
> we'd be happy to receive and review those patches.
>
> Thanks!
> Mark
>
> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali <kanth909@gmail.com> wrote:
>
> > Thanks Ross! and yes option C sounds good to me as well however I just
> > think Distributed Sql query engine  and the resource manager should be
> > pluggable.
> >
> >
> >
> >
> > On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <alan.d.ross@intel.com>
> > wrote:
> >
> > > Option C is to use python on the front end of ingest pipeline and
> > > spark/scala on the back end.
> > >
> > > Option A uses python workers on the backend
> > >
> > > Option B uses all scala.
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: kant kodali [mailto:kanth909@gmail.com]
> > > Sent: Friday, April 14, 2017 9:53 AM
> > > To: dev@spot.incubator.apache.org
> > > Subject: Re: [Discuss] - Future plans for Spot-ingest
> > >
> > > What is option C ? am I missing an email or something?
> > >
> > > On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai <
> > > chokha@integralops.com> wrote:
> > >
> > > > +1 for Python 3.x
> > > >
> > > >
> > > >
> > > > On 4/14/2017 11:59 AM, Austin Leahy wrote:
> > > >
> > > >> I think that C is the strong solution, getting the ingest really
> > > >> strong is going to lower barriers to adoption. Doing it in Python
> > > >> will open up the ingest portion of the project to include many more
> > > developers.
> > > >>
> > > >> Before it comes up I would like to throw the following on the
> pile...
> > > >> Major
> > > >> python projects django/flash, others are dropping 2.x support in
> > > >> releases scheduled in the next 6 to 8 months. Hadoop projects in
> > > >> general tend to lag in modern python support, lets please build this
> > > >> in 3.5 so that we don't have to immediately expect a rebuild in the
> > > >> pipeline.
> > > >>
> > > >> -Vote C
> > > >>
> > > >> Thanks Nate
> > > >>
> > > >> Austin
> > > >>
> > > >> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <alan@apache.org>
wrote:
> > > >>
> > > >> I really like option C because it gives a lot of flexibility for
> > > >> ingest
> > > >>> (python vs scala) but still has the robust spark streaming backend
> > > >>> for performance.
> > > >>>
> > > >>> Thanks for putting this together Nate.
> > > >>>
> > > >>> Alan
> > > >>>
> > > >>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
> > > >>> chokha@integralops.com> wrote:
> > > >>>
> > > >>> I agree. We should continue making the existing stack more mature
> at
> > > >>>> this point. Maybe if we have enough community support we can
add
> > > >>>> additional datastores.
> > > >>>>
> > > >>>> Chokha.
> > > >>>>
> > > >>>>
> > > >>>> On 4/14/17 11:10 AM, kenneth@floss.cat wrote:
> > > >>>>
> > > >>>>> Hi Kant,
> > > >>>>>
> > > >>>>>
> > > >>>>> YARN is the standard scheduler in Hadoop. If you're using
> > > >>>>> Hive+Spark, then sure you'll have YARN.
> > > >>>>>
> > > >>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is
based on
> a
> > > >>>>> quite standard Hadoop stack and I wouldn't switch too
many pieces
> > > yet.
> > > >>>>>
> > > >>>>> In most Opensource projects you start relying on a well-known
> > > >>>>> stack and then you begin to support other DB backends
once it's
> > > >>>>> quite mature. Think in the loads of LAMP apps which haven't
been
> > > >>>>> ported away from MySQL yet.
> > > >>>>>
> > > >>>>> In any case, you'll need a high performance SQL + Massive
Storage
> > > >>>>> + Machine Learning + Massive Ingestion, and... ATM, that
can be
> > > >>>>> only provided by Hadoop.
> > > >>>>>
> > > >>>>> Regards!
> > > >>>>>
> > > >>>>> Kenneth
> > > >>>>>
> > > >>>>> A 2017-04-14 12:56, kant kodali escrigué:
> > > >>>>>
> > > >>>>>> Hi Kenneth,
> > > >>>>>>
> > > >>>>>> Thanks for the response.  I think you made a case
for HDFS
> > > >>>>>> however users may want to use S3 or some other FS
in which case
> > > >>>>>> they can use Auxilio (hoping that there are no changes
needed
> > > >>>>>> within Spot in which case I
> > > >>>>>>
> > > >>>>> can
> > > >>>
> > > >>>> agree to that). for example, Netflix stores all there data
into S3
> > > >>>>>>
> > > >>>>>> The distributed sql query engine I would say should
be pluggable
> > > >>>>>> with whatever user may want to use and there a bunch
of them out
> > > there.
> > > >>>>>>
> > > >>>>> sure
> > > >>>
> > > >>>> Impala is better than hive but what if users are already using
> > > >>>>>>
> > > >>>>> something
> > > >>>
> > > >>>> else like Drill or Presto?
> > > >>>>>>
> > > >>>>>> Me personally, would not assume that users are willing
to deploy
> > > >>>>>> all
> > > >>>>>>
> > > >>>>> of
> > > >>>
> > > >>>> that and make their existing stack more complicated at very
least
> I
> > > >>>>>> would
> > > >>>>>> say it is a uphill battle. Things have been changing
rapidly in
> > > >>>>>> Big
> > > >>>>>>
> > > >>>>> data
> > > >>>
> > > >>>> space so whatever we think is standard won't be standard anymore
> > > >>>> but
> > > >>>>>> importantly there shouldn't be any reason why we shouldn't
be
> > > >>>>>> flexible right.
> > > >>>>>>
> > > >>>>>> Also I am not sure why only YARN? why not make that
also more
> > > >>>>>> flexible so users can pick Mesos or standalone.
> > > >>>>>>
> > > >>>>>> I think Flexibility is a key for a wide adoption rather
than the
> > > >>>>>>
> > > >>>>> tightly
> > > >>>
> > > >>>> coupled architecture.
> > > >>>>>>
> > > >>>>>> Thanks!
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza
> > > >>>>>> <kenneth@floss.cat>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>> PS: you need a big data platform to be able to collect
all those
> > > >>>>>>> netflows
> > > >>>>>>> and logs.
> > > >>>>>>>
> > > >>>>>>> Spot isn't intended for SMBs, that's clear, then
you need loads
> > > >>>>>>> of data to get ML working properly, and somewhere
to run those
> > > >>>>>>> algorithms. That
> > > >>>>>>>
> > > >>>>>> is
> > > >>>
> > > >>>> Hadoop.
> > > >>>>>>>
> > > >>>>>>> Regards!
> > > >>>>>>>
> > > >>>>>>> Kenneth
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Sent from my Mi phone
> > > >>>>>>> On kant kodali <kanth909@gmail.com>, Apr
14, 2017 4:04 AM
> wrote:
> > > >>>>>>>
> > > >>>>>>> Hi,
> > > >>>>>>>
> > > >>>>>>> Thanks for starting this thread. Here is my feedback.
> > > >>>>>>>
> > > >>>>>>> I somehow think the architecture is too complicated
for wide
> > > >>>>>>> adoption since it requires to install the following.
> > > >>>>>>>
> > > >>>>>>> HDFS.
> > > >>>>>>> HIVE.
> > > >>>>>>> IMPALA.
> > > >>>>>>> KAFKA.
> > > >>>>>>> SPARK (YARN).
> > > >>>>>>> YARN.
> > > >>>>>>> Zookeeper.
> > > >>>>>>>
> > > >>>>>>> Currently there are way too many dependencies
that discourages
> > > >>>>>>> lot of users from using it because they have to
go through
> > > >>>>>>> deployment of all that required software. I think
for wide
> > > >>>>>>> option we should minimize the dependencies and
have more
> > > >>>>>>> pluggable architecture. for example I am
> > > >>>>>>>
> > > >>>>>> not
> > > >>>
> > > >>>> sure why HIVE & IMPALA both are required? why not just
use Spark
> > > >>>> SQL
> > > >>>>>>> since
> > > >>>>>>> its already dependency or say users may want to
use their own
> > > >>>>>>> distributed query engine they like such as Apache
Drill or
> > > >>>>>>> something else. we should be flexible enough to
provide that
> > > >>>>>>> option
> > > >>>>>>>
> > > >>>>>>> Also, I see that HDFS is used such that collectors
can receive
> > > >>>>>>> file path's through Kafka and be able to read
a file. How big
> > > >>>>>>> are these files ?
> > > >>>>>>> Do we
> > > >>>>>>> really need HDFS for this? Why not provide more
ways to send
> > > >>>>>>> data such as sending data directly through Kafka
or say just
> > > >>>>>>> leaving up to the user to specify the file location
as an
> > > >>>>>>> argument to collector process
> > > >>>>>>>
> > > >>>>>>> Finally, I learnt that to generate Net flow data
one would
> > > >>>>>>> require a specific hardware. This really means
Apache Spot is
> > > >>>>>>> not meant for everyone.
> > > >>>>>>> I thought Apache Spot can be used to analyze the
network
> traffic
> > > >>>>>>> of
> > > >>>>>>>
> > > >>>>>> any
> > > >>>
> > > >>>> machine but if it requires a specific hard then I think it
is
> > > >>>>>>> targeted for
> > > >>>>>>> specific group of people.
> > > >>>>>>>
> > > >>>>>>> The real strength of Apache Spot should mainly
be just
> analyzing
> > > >>>>>>> network traffic through ML.
> > > >>>>>>>
> > > >>>>>>> Thanks!
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan
L <
> > > >>>>>>> nathan.l.segerlind@intel.com> wrote:
> > > >>>>>>>
> > > >>>>>>> Thanks, Nate,
> > > >>>>>>>>
> > > >>>>>>>> Nate.
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> -----Original Message-----
> > > >>>>>>>> From: Nate Smith [mailto:natedogs911@gmail.com]
> > > >>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
> > > >>>>>>>> To: user@spot.incubator.apache.org
> > > >>>>>>>> Cc: dev@spot.incubator.apache.org;
> > > >>>>>>>>
> > > >>>>>>> private@spot.incubator.apache.org
> > > >>>
> > > >>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
> > > >>>>>>>>
> > > >>>>>>>> I was really hoping it came through ok, Oh
well :) Here’s an
> > > >>>>>>>> image form:
> > > >>>>>>>> http://imgur.com/a/DUDsD
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan
L <
> > > >>>>>>>>>
> > > >>>>>>>> nathan.l.segerlind@intel.com> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> The diagram became garbled in the text
format.
> > > >>>>>>>>> Could you resend it as a pdf?
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks,
> > > >>>>>>>>> Nate
> > > >>>>>>>>>
> > > >>>>>>>>> -----Original Message-----
> > > >>>>>>>>> From: Nathanael Smith [mailto:nathanael@apache.org]
> > > >>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
> > > >>>>>>>>> To: private@spot.incubator.apache.org;
> > > >>>>>>>>>
> > > >>>>>>>> dev@spot.incubator.apache.org;
> > > >>>>>>>
> > > >>>>>>>> user@spot.incubator.apache.org
> > > >>>>>>>>
> > > >>>>>>>>> Subject: [Discuss] - Future plans for
Spot-ingest
> > > >>>>>>>>>
> > > >>>>>>>>> How would you like to see Spot-ingest
change?
> > > >>>>>>>>>
> > > >>>>>>>>> A. continue development on the Python
Master/Worker with
> focus
> > > >>>>>>>>> on
> > > >>>>>>>>>
> > > >>>>>>>> performance / error handling / logging B.
Develop Scala based
> > > >>>>>>>>
> > > >>>>>>> ingest to
> > > >>>>>>> be
> > > >>>>>>>
> > > >>>>>>>> inline with code base from ingest, ml, to
OA (UI to continue
> > > >>>>>>>> being
> > > >>>>>>>> ipython/JS) C. Python ingest Worker with Scala
based Spark
> code
> > > >>>>>>>> for normalization and input into DB
> > > >>>>>>>>
> > > >>>>>>>>> Including the high level diagram:
> > > >>>>>>>>> +-----------------------------------------------------------
> > > >>>>>>>>>
> > > >>>>>>>> -------------------------------+
> > > >>>>>>>>
> > > >>>>>>>>> | +--------------------------+
> > > >>>>>>>>>
> > > >>>>>>>> +-----------------+        |
> > > >>>>>>>>
> > > >>>>>>>>> | | Master                   |  A. B.
C.
> >   |
> > > >>>>>>>>>
> > > >>>>>>>> Worker          |        |
> > > >>>>>>>>
> > > >>>>>>>>> | |    A. Python             +---------------+
     A.
> > > >>>>>>>>>
> > > >>>>>>>> |   A.
> > > >>>>>>>
> > > >>>>>>>> Python     |        |
> > > >>>>>>>>
> > > >>>>>>>>> | |    B. Scala              |       
       |
> > +------------->
> > > >>>>>>>>>
> > > >>>>>>>>           +----+   |
> > > >>>>>>>>
> > > >>>>>>>>> | |    C. Python             |       
       |    |
> >  |
> > > >>>>>>>>>
> > > >>>>>>>>           |    |   |
> > > >>>>>>>>
> > > >>>>>>>>> | +---^------+---------------+       
       |    |
> > > >>>>>>>>>
> > > >>>>>>>>   +-----------------+    |   |
> > > >>>>>>>>
> > > >>>>>>>>> |     |      |                       
       |    |
> > > >>>>>>>>>
> > > >>>>>>>>                |   |
> > > >>>>>>>>
> > > >>>>>>>>> |     |      |                       
       |    |
> > > >>>>>>>>>
> > > >>>>>>>>                |   |
> > > >>>>>>>>
> > > >>>>>>>>> |     |     +Note--------------+     
       |    |
> > > >>>>>>>>>
> > > >>>>>>>>   +-----------------+    |   |
> > > >>>>>>>>
> > > >>>>>>>>> |     |     |Running on a      |     
       |    |
> >  |
> > > >>>>>>>>>
> > > >>>>>>>> Spark
> > > >>>>>>>
> > > >>>>>>>> Streaming |    |   |
> > > >>>>>>>>
> > > >>>>>>>>> |     |     |worker node in    |     
       |    |      B.
> C.
> > > >>>>>>>>>
> > > >>>>>>>> | B.
> > > >>>>>>>
> > > >>>>>>>> Scala        |    |   |
> > > >>>>>>>>
> > > >>>>>>>>> |     |     |the Hadoop cluster|     
       |    |
> > > >>>>>>>>>
> > > >>>>>>>> +--------> C.
> > > >>>>>>>
> > > >>>>>>>> Scala        +-+  |   |
> > > >>>>>>>>
> > > >>>>>>>>> |     |     +------------------+     
       |    |    |
> >   |
> > > >>>>>>>>>
> > > >>>>>>>>           | |  |   |
> > > >>>>>>>>
> > > >>>>>>>>> |   A.|                              
       |    |    |
> > > >>>>>>>>>
> > > >>>>>>>> +-----------------+ |  |   |
> > > >>>>>>>>
> > > >>>>>>>>> |   B.|                              
       |    |    |
> > > >>>>>>>>>
> > > >>>>>>>>              |  |   |
> > > >>>>>>>>
> > > >>>>>>>>> |   C.|                              
       |    |    |
> > > >>>>>>>>>
> > > >>>>>>>>              |  |   |
> > > >>>>>>>>
> > > >>>>>>>>> | +----------------------+          +-v------+----+----+-+
> > > >>>>>>>>>
> > > >>>>>>>>   +--------------v--v-+ |
> > > >>>>>>>>
> > > >>>>>>>>> | |                      |          |
> > > >>>>>>>>>
> > > >>>>>>>> |           |
> > > >>>>>>>
> > > >>>>>>>>                   | |
> > > >>>>>>>>
> > > >>>>>>>>> | |   Local FS:          |          |
   hdfs
> > > >>>>>>>>>
> > > >>>>>>>> |           |
> > > >>>>>>>
> > > >>>>>>>> Hive / Impala    | |
> > > >>>>>>>>
> > > >>>>>>>>> | |  - Binary/Text       |          |
> > > >>>>>>>>>
> > > >>>>>>>> |           |
> > > >>>>>>>
> > > >>>>>>>>   - Parquet -     | |
> > > >>>>>>>>
> > > >>>>>>>>> | |    Log files -       |          |
> > > >>>>>>>>>
> > > >>>>>>>> |           |
> > > >>>>>>>
> > > >>>>>>>>                   | |
> > > >>>>>>>>
> > > >>>>>>>>> | |                      |          |
> > > >>>>>>>>>
> > > >>>>>>>> |           |
> > > >>>>>>>
> > > >>>>>>>>                   | |
> > > >>>>>>>>
> > > >>>>>>>>> | +----------------------+          +--------------------+
> > > >>>>>>>>>
> > > >>>>>>>>   +-------------------+ |
> > > >>>>>>>>
> > > >>>>>>>>> +-----------------------------------------------------------
> > > >>>>>>>>>
> > > >>>>>>>> -------------------------------+
> > > >>>>>>>>
> > > >>>>>>>>> Please let me know your thoughts,
> > > >>>>>>>>>
> > > >>>>>>>>> - Nathanael
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message