spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Grover <m...@apache.org>
Subject Re: [Discuss] - Future plans for Spot-ingest
Date Fri, 14 Apr 2017 20:38:44 GMT
Hi Kant,
Just wanted to make sure you don't feel like we are ignoring your
comment:-) I hear you about pluggability.

The design can and should be pluggable but the project has one stack it
ships out of the box with, one stack that's the default stack in the sense
that it's the most tested and so on. And, for us, that's our current stack.
If we were to take Apache Hive as an example, it shipped (and ships) with
MapReduce as the default configuration engine. At some point, Apache Tez
came along and wanted Hive to run on Tez, so they made a bunch of things
pluggable to run Hive on Tez (instead of the only option up-until then:
Hive-on-MR) and then Apache Spark came and re-used some of that
pluggability and even added some more so Hive-on-Spark could become a
reality. In the same way, I don't think anyone disagrees here that
pluggabilty is a good thing but it's hard to do pluggability right, and at
the right level, unless on has a clear use-case in mind.

As a project, we have many things to do and I personally think the biggest
bang for the buck for us in making Spot a really solid and the best cyber
security solution isn't pluggability but the things we are working on - a
better user interface, a common/unified approach to storing and modeling
data, etc.

Having said that, we are open, if it's important to you or someone else,
we'd be happy to receive and review those patches.

Thanks!
Mark

On Fri, Apr 14, 2017 at 10:14 AM, kant kodali <kanth909@gmail.com> wrote:

> Thanks Ross! and yes option C sounds good to me as well however I just
> think Distributed Sql query engine  and the resource manager should be
> pluggable.
>
>
>
>
> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <alan.d.ross@intel.com>
> wrote:
>
> > Option C is to use python on the front end of ingest pipeline and
> > spark/scala on the back end.
> >
> > Option A uses python workers on the backend
> >
> > Option B uses all scala.
> >
> >
> >
> > -----Original Message-----
> > From: kant kodali [mailto:kanth909@gmail.com]
> > Sent: Friday, April 14, 2017 9:53 AM
> > To: dev@spot.incubator.apache.org
> > Subject: Re: [Discuss] - Future plans for Spot-ingest
> >
> > What is option C ? am I missing an email or something?
> >
> > On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai <
> > chokha@integralops.com> wrote:
> >
> > > +1 for Python 3.x
> > >
> > >
> > >
> > > On 4/14/2017 11:59 AM, Austin Leahy wrote:
> > >
> > >> I think that C is the strong solution, getting the ingest really
> > >> strong is going to lower barriers to adoption. Doing it in Python
> > >> will open up the ingest portion of the project to include many more
> > developers.
> > >>
> > >> Before it comes up I would like to throw the following on the pile...
> > >> Major
> > >> python projects django/flash, others are dropping 2.x support in
> > >> releases scheduled in the next 6 to 8 months. Hadoop projects in
> > >> general tend to lag in modern python support, lets please build this
> > >> in 3.5 so that we don't have to immediately expect a rebuild in the
> > >> pipeline.
> > >>
> > >> -Vote C
> > >>
> > >> Thanks Nate
> > >>
> > >> Austin
> > >>
> > >> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <alan@apache.org> wrote:
> > >>
> > >> I really like option C because it gives a lot of flexibility for
> > >> ingest
> > >>> (python vs scala) but still has the robust spark streaming backend
> > >>> for performance.
> > >>>
> > >>> Thanks for putting this together Nate.
> > >>>
> > >>> Alan
> > >>>
> > >>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
> > >>> chokha@integralops.com> wrote:
> > >>>
> > >>> I agree. We should continue making the existing stack more mature at
> > >>>> this point. Maybe if we have enough community support we can add
> > >>>> additional datastores.
> > >>>>
> > >>>> Chokha.
> > >>>>
> > >>>>
> > >>>> On 4/14/17 11:10 AM, kenneth@floss.cat wrote:
> > >>>>
> > >>>>> Hi Kant,
> > >>>>>
> > >>>>>
> > >>>>> YARN is the standard scheduler in Hadoop. If you're using
> > >>>>> Hive+Spark, then sure you'll have YARN.
> > >>>>>
> > >>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is based
on a
> > >>>>> quite standard Hadoop stack and I wouldn't switch too many
pieces
> > yet.
> > >>>>>
> > >>>>> In most Opensource projects you start relying on a well-known
> > >>>>> stack and then you begin to support other DB backends once
it's
> > >>>>> quite mature. Think in the loads of LAMP apps which haven't
been
> > >>>>> ported away from MySQL yet.
> > >>>>>
> > >>>>> In any case, you'll need a high performance SQL + Massive Storage
> > >>>>> + Machine Learning + Massive Ingestion, and... ATM, that can
be
> > >>>>> only provided by Hadoop.
> > >>>>>
> > >>>>> Regards!
> > >>>>>
> > >>>>> Kenneth
> > >>>>>
> > >>>>> A 2017-04-14 12:56, kant kodali escrigué:
> > >>>>>
> > >>>>>> Hi Kenneth,
> > >>>>>>
> > >>>>>> Thanks for the response.  I think you made a case for HDFS
> > >>>>>> however users may want to use S3 or some other FS in which
case
> > >>>>>> they can use Auxilio (hoping that there are no changes
needed
> > >>>>>> within Spot in which case I
> > >>>>>>
> > >>>>> can
> > >>>
> > >>>> agree to that). for example, Netflix stores all there data into
S3
> > >>>>>>
> > >>>>>> The distributed sql query engine I would say should be
pluggable
> > >>>>>> with whatever user may want to use and there a bunch of
them out
> > there.
> > >>>>>>
> > >>>>> sure
> > >>>
> > >>>> Impala is better than hive but what if users are already using
> > >>>>>>
> > >>>>> something
> > >>>
> > >>>> else like Drill or Presto?
> > >>>>>>
> > >>>>>> Me personally, would not assume that users are willing
to deploy
> > >>>>>> all
> > >>>>>>
> > >>>>> of
> > >>>
> > >>>> that and make their existing stack more complicated at very least
I
> > >>>>>> would
> > >>>>>> say it is a uphill battle. Things have been changing rapidly
in
> > >>>>>> Big
> > >>>>>>
> > >>>>> data
> > >>>
> > >>>> space so whatever we think is standard won't be standard anymore
> > >>>> but
> > >>>>>> importantly there shouldn't be any reason why we shouldn't
be
> > >>>>>> flexible right.
> > >>>>>>
> > >>>>>> Also I am not sure why only YARN? why not make that also
more
> > >>>>>> flexible so users can pick Mesos or standalone.
> > >>>>>>
> > >>>>>> I think Flexibility is a key for a wide adoption rather
than the
> > >>>>>>
> > >>>>> tightly
> > >>>
> > >>>> coupled architecture.
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza
> > >>>>>> <kenneth@floss.cat>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>> PS: you need a big data platform to be able to collect
all those
> > >>>>>>> netflows
> > >>>>>>> and logs.
> > >>>>>>>
> > >>>>>>> Spot isn't intended for SMBs, that's clear, then you
need loads
> > >>>>>>> of data to get ML working properly, and somewhere to
run those
> > >>>>>>> algorithms. That
> > >>>>>>>
> > >>>>>> is
> > >>>
> > >>>> Hadoop.
> > >>>>>>>
> > >>>>>>> Regards!
> > >>>>>>>
> > >>>>>>> Kenneth
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Sent from my Mi phone
> > >>>>>>> On kant kodali <kanth909@gmail.com>, Apr 14,
2017 4:04 AM wrote:
> > >>>>>>>
> > >>>>>>> Hi,
> > >>>>>>>
> > >>>>>>> Thanks for starting this thread. Here is my feedback.
> > >>>>>>>
> > >>>>>>> I somehow think the architecture is too complicated
for wide
> > >>>>>>> adoption since it requires to install the following.
> > >>>>>>>
> > >>>>>>> HDFS.
> > >>>>>>> HIVE.
> > >>>>>>> IMPALA.
> > >>>>>>> KAFKA.
> > >>>>>>> SPARK (YARN).
> > >>>>>>> YARN.
> > >>>>>>> Zookeeper.
> > >>>>>>>
> > >>>>>>> Currently there are way too many dependencies that
discourages
> > >>>>>>> lot of users from using it because they have to go
through
> > >>>>>>> deployment of all that required software. I think for
wide
> > >>>>>>> option we should minimize the dependencies and have
more
> > >>>>>>> pluggable architecture. for example I am
> > >>>>>>>
> > >>>>>> not
> > >>>
> > >>>> sure why HIVE & IMPALA both are required? why not just use
Spark
> > >>>> SQL
> > >>>>>>> since
> > >>>>>>> its already dependency or say users may want to use
their own
> > >>>>>>> distributed query engine they like such as Apache Drill
or
> > >>>>>>> something else. we should be flexible enough to provide
that
> > >>>>>>> option
> > >>>>>>>
> > >>>>>>> Also, I see that HDFS is used such that collectors
can receive
> > >>>>>>> file path's through Kafka and be able to read a file.
How big
> > >>>>>>> are these files ?
> > >>>>>>> Do we
> > >>>>>>> really need HDFS for this? Why not provide more ways
to send
> > >>>>>>> data such as sending data directly through Kafka or
say just
> > >>>>>>> leaving up to the user to specify the file location
as an
> > >>>>>>> argument to collector process
> > >>>>>>>
> > >>>>>>> Finally, I learnt that to generate Net flow data one
would
> > >>>>>>> require a specific hardware. This really means Apache
Spot is
> > >>>>>>> not meant for everyone.
> > >>>>>>> I thought Apache Spot can be used to analyze the network
traffic
> > >>>>>>> of
> > >>>>>>>
> > >>>>>> any
> > >>>
> > >>>> machine but if it requires a specific hard then I think it is
> > >>>>>>> targeted for
> > >>>>>>> specific group of people.
> > >>>>>>>
> > >>>>>>> The real strength of Apache Spot should mainly be just
analyzing
> > >>>>>>> network traffic through ML.
> > >>>>>>>
> > >>>>>>> Thanks!
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan
L <
> > >>>>>>> nathan.l.segerlind@intel.com> wrote:
> > >>>>>>>
> > >>>>>>> Thanks, Nate,
> > >>>>>>>>
> > >>>>>>>> Nate.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> -----Original Message-----
> > >>>>>>>> From: Nate Smith [mailto:natedogs911@gmail.com]
> > >>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
> > >>>>>>>> To: user@spot.incubator.apache.org
> > >>>>>>>> Cc: dev@spot.incubator.apache.org;
> > >>>>>>>>
> > >>>>>>> private@spot.incubator.apache.org
> > >>>
> > >>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
> > >>>>>>>>
> > >>>>>>>> I was really hoping it came through ok, Oh well
:) Here’s an
> > >>>>>>>> image form:
> > >>>>>>>> http://imgur.com/a/DUDsD
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan
L <
> > >>>>>>>>>
> > >>>>>>>> nathan.l.segerlind@intel.com> wrote:
> > >>>>>>>>
> > >>>>>>>>> The diagram became garbled in the text format.
> > >>>>>>>>> Could you resend it as a pdf?
> > >>>>>>>>>
> > >>>>>>>>> Thanks,
> > >>>>>>>>> Nate
> > >>>>>>>>>
> > >>>>>>>>> -----Original Message-----
> > >>>>>>>>> From: Nathanael Smith [mailto:nathanael@apache.org]
> > >>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
> > >>>>>>>>> To: private@spot.incubator.apache.org;
> > >>>>>>>>>
> > >>>>>>>> dev@spot.incubator.apache.org;
> > >>>>>>>
> > >>>>>>>> user@spot.incubator.apache.org
> > >>>>>>>>
> > >>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest
> > >>>>>>>>>
> > >>>>>>>>> How would you like to see Spot-ingest change?
> > >>>>>>>>>
> > >>>>>>>>> A. continue development on the Python Master/Worker
with focus
> > >>>>>>>>> on
> > >>>>>>>>>
> > >>>>>>>> performance / error handling / logging B. Develop
Scala based
> > >>>>>>>>
> > >>>>>>> ingest to
> > >>>>>>> be
> > >>>>>>>
> > >>>>>>>> inline with code base from ingest, ml, to OA (UI
to continue
> > >>>>>>>> being
> > >>>>>>>> ipython/JS) C. Python ingest Worker with Scala
based Spark code
> > >>>>>>>> for normalization and input into DB
> > >>>>>>>>
> > >>>>>>>>> Including the high level diagram:
> > >>>>>>>>> +-----------------------------------------------------------
> > >>>>>>>>>
> > >>>>>>>> -------------------------------+
> > >>>>>>>>
> > >>>>>>>>> | +--------------------------+
> > >>>>>>>>>
> > >>>>>>>> +-----------------+        |
> > >>>>>>>>
> > >>>>>>>>> | | Master                   |  A. B. C.
>   |
> > >>>>>>>>>
> > >>>>>>>> Worker          |        |
> > >>>>>>>>
> > >>>>>>>>> | |    A. Python             +---------------+
     A.
> > >>>>>>>>>
> > >>>>>>>> |   A.
> > >>>>>>>
> > >>>>>>>> Python     |        |
> > >>>>>>>>
> > >>>>>>>>> | |    B. Scala              |            
  |
> +------------->
> > >>>>>>>>>
> > >>>>>>>>           +----+   |
> > >>>>>>>>
> > >>>>>>>>> | |    C. Python             |            
  |    |
>  |
> > >>>>>>>>>
> > >>>>>>>>           |    |   |
> > >>>>>>>>
> > >>>>>>>>> | +---^------+---------------+            
  |    |
> > >>>>>>>>>
> > >>>>>>>>   +-----------------+    |   |
> > >>>>>>>>
> > >>>>>>>>> |     |      |                            
  |    |
> > >>>>>>>>>
> > >>>>>>>>                |   |
> > >>>>>>>>
> > >>>>>>>>> |     |      |                            
  |    |
> > >>>>>>>>>
> > >>>>>>>>                |   |
> > >>>>>>>>
> > >>>>>>>>> |     |     +Note--------------+          
  |    |
> > >>>>>>>>>
> > >>>>>>>>   +-----------------+    |   |
> > >>>>>>>>
> > >>>>>>>>> |     |     |Running on a      |          
  |    |
>  |
> > >>>>>>>>>
> > >>>>>>>> Spark
> > >>>>>>>
> > >>>>>>>> Streaming |    |   |
> > >>>>>>>>
> > >>>>>>>>> |     |     |worker node in    |          
  |    |      B. C.
> > >>>>>>>>>
> > >>>>>>>> | B.
> > >>>>>>>
> > >>>>>>>> Scala        |    |   |
> > >>>>>>>>
> > >>>>>>>>> |     |     |the Hadoop cluster|          
  |    |
> > >>>>>>>>>
> > >>>>>>>> +--------> C.
> > >>>>>>>
> > >>>>>>>> Scala        +-+  |   |
> > >>>>>>>>
> > >>>>>>>>> |     |     +------------------+          
  |    |    |
>   |
> > >>>>>>>>>
> > >>>>>>>>           | |  |   |
> > >>>>>>>>
> > >>>>>>>>> |   A.|                                   
  |    |    |
> > >>>>>>>>>
> > >>>>>>>> +-----------------+ |  |   |
> > >>>>>>>>
> > >>>>>>>>> |   B.|                                   
  |    |    |
> > >>>>>>>>>
> > >>>>>>>>              |  |   |
> > >>>>>>>>
> > >>>>>>>>> |   C.|                                   
  |    |    |
> > >>>>>>>>>
> > >>>>>>>>              |  |   |
> > >>>>>>>>
> > >>>>>>>>> | +----------------------+          +-v------+----+----+-+
> > >>>>>>>>>
> > >>>>>>>>   +--------------v--v-+ |
> > >>>>>>>>
> > >>>>>>>>> | |                      |          |
> > >>>>>>>>>
> > >>>>>>>> |           |
> > >>>>>>>
> > >>>>>>>>                   | |
> > >>>>>>>>
> > >>>>>>>>> | |   Local FS:          |          |    hdfs
> > >>>>>>>>>
> > >>>>>>>> |           |
> > >>>>>>>
> > >>>>>>>> Hive / Impala    | |
> > >>>>>>>>
> > >>>>>>>>> | |  - Binary/Text       |          |
> > >>>>>>>>>
> > >>>>>>>> |           |
> > >>>>>>>
> > >>>>>>>>   - Parquet -     | |
> > >>>>>>>>
> > >>>>>>>>> | |    Log files -       |          |
> > >>>>>>>>>
> > >>>>>>>> |           |
> > >>>>>>>
> > >>>>>>>>                   | |
> > >>>>>>>>
> > >>>>>>>>> | |                      |          |
> > >>>>>>>>>
> > >>>>>>>> |           |
> > >>>>>>>
> > >>>>>>>>                   | |
> > >>>>>>>>
> > >>>>>>>>> | +----------------------+          +--------------------+
> > >>>>>>>>>
> > >>>>>>>>   +-------------------+ |
> > >>>>>>>>
> > >>>>>>>>> +-----------------------------------------------------------
> > >>>>>>>>>
> > >>>>>>>> -------------------------------+
> > >>>>>>>>
> > >>>>>>>>> Please let me know your thoughts,
> > >>>>>>>>>
> > >>>>>>>>> - Nathanael
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message