spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jon Zeolla <JonZeo...@apache.org>
Subject Re: [Discuss] - Future plans for Spot-ingest
Date Fri, 14 Apr 2017 17:45:45 GMT
I vote for Option C as well, and +1 to the preference of Python 3.x

Jon

On Fri, Apr 14, 2017 at 1:15 PM kant kodali <kanth909@gmail.com> wrote:

> Thanks Ross! and yes option C sounds good to me as well however I just
> think Distributed Sql query engine  and the resource manager should be
> pluggable.
>
>
>
>
> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <alan.d.ross@intel.com>
> wrote:
>
> > Option C is to use python on the front end of ingest pipeline and
> > spark/scala on the back end.
> >
> > Option A uses python workers on the backend
> >
> > Option B uses all scala.
> >
> >
> >
> > -----Original Message-----
> > From: kant kodali [mailto:kanth909@gmail.com]
> > Sent: Friday, April 14, 2017 9:53 AM
> > To: dev@spot.incubator.apache.org
> > Subject: Re: [Discuss] - Future plans for Spot-ingest
> >
> > What is option C ? am I missing an email or something?
> >
> > On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai <
> > chokha@integralops.com> wrote:
> >
> > > +1 for Python 3.x
> > >
> > >
> > >
> > > On 4/14/2017 11:59 AM, Austin Leahy wrote:
> > >
> > >> I think that C is the strong solution, getting the ingest really
> > >> strong is going to lower barriers to adoption. Doing it in Python
> > >> will open up the ingest portion of the project to include many more
> > developers.
> > >>
> > >> Before it comes up I would like to throw the following on the pile...
> > >> Major
> > >> python projects django/flash, others are dropping 2.x support in
> > >> releases scheduled in the next 6 to 8 months. Hadoop projects in
> > >> general tend to lag in modern python support, lets please build this
> > >> in 3.5 so that we don't have to immediately expect a rebuild in the
> > >> pipeline.
> > >>
> > >> -Vote C
> > >>
> > >> Thanks Nate
> > >>
> > >> Austin
> > >>
> > >> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <alan@apache.org> wrote:
> > >>
> > >> I really like option C because it gives a lot of flexibility for
> > >> ingest
> > >>> (python vs scala) but still has the robust spark streaming backend
> > >>> for performance.
> > >>>
> > >>> Thanks for putting this together Nate.
> > >>>
> > >>> Alan
> > >>>
> > >>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
> > >>> chokha@integralops.com> wrote:
> > >>>
> > >>> I agree. We should continue making the existing stack more mature at
> > >>>> this point. Maybe if we have enough community support we can add
> > >>>> additional datastores.
> > >>>>
> > >>>> Chokha.
> > >>>>
> > >>>>
> > >>>> On 4/14/17 11:10 AM, kenneth@floss.cat wrote:
> > >>>>
> > >>>>> Hi Kant,
> > >>>>>
> > >>>>>
> > >>>>> YARN is the standard scheduler in Hadoop. If you're using
> > >>>>> Hive+Spark, then sure you'll have YARN.
> > >>>>>
> > >>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is based
on a
> > >>>>> quite standard Hadoop stack and I wouldn't switch too many
pieces
> > yet.
> > >>>>>
> > >>>>> In most Opensource projects you start relying on a well-known
> > >>>>> stack and then you begin to support other DB backends once
it's
> > >>>>> quite mature. Think in the loads of LAMP apps which haven't
been
> > >>>>> ported away from MySQL yet.
> > >>>>>
> > >>>>> In any case, you'll need a high performance SQL + Massive Storage
> > >>>>> + Machine Learning + Massive Ingestion, and... ATM, that can
be
> > >>>>> only provided by Hadoop.
> > >>>>>
> > >>>>> Regards!
> > >>>>>
> > >>>>> Kenneth
> > >>>>>
> > >>>>> A 2017-04-14 12:56, kant kodali escrigué:
> > >>>>>
> > >>>>>> Hi Kenneth,
> > >>>>>>
> > >>>>>> Thanks for the response.  I think you made a case for HDFS
> > >>>>>> however users may want to use S3 or some other FS in which
case
> > >>>>>> they can use Auxilio (hoping that there are no changes
needed
> > >>>>>> within Spot in which case I
> > >>>>>>
> > >>>>> can
> > >>>
> > >>>> agree to that). for example, Netflix stores all there data into
S3
> > >>>>>>
> > >>>>>> The distributed sql query engine I would say should be
pluggable
> > >>>>>> with whatever user may want to use and there a bunch of
them out
> > there.
> > >>>>>>
> > >>>>> sure
> > >>>
> > >>>> Impala is better than hive but what if users are already using
> > >>>>>>
> > >>>>> something
> > >>>
> > >>>> else like Drill or Presto?
> > >>>>>>
> > >>>>>> Me personally, would not assume that users are willing
to deploy
> > >>>>>> all
> > >>>>>>
> > >>>>> of
> > >>>
> > >>>> that and make their existing stack more complicated at very least
I
> > >>>>>> would
> > >>>>>> say it is a uphill battle. Things have been changing rapidly
in
> > >>>>>> Big
> > >>>>>>
> > >>>>> data
> > >>>
> > >>>> space so whatever we think is standard won't be standard anymore
> > >>>> but
> > >>>>>> importantly there shouldn't be any reason why we shouldn't
be
> > >>>>>> flexible right.
> > >>>>>>
> > >>>>>> Also I am not sure why only YARN? why not make that also
more
> > >>>>>> flexible so users can pick Mesos or standalone.
> > >>>>>>
> > >>>>>> I think Flexibility is a key for a wide adoption rather
than the
> > >>>>>>
> > >>>>> tightly
> > >>>
> > >>>> coupled architecture.
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza
> > >>>>>> <kenneth@floss.cat>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>> PS: you need a big data platform to be able to collect
all those
> > >>>>>>> netflows
> > >>>>>>> and logs.
> > >>>>>>>
> > >>>>>>> Spot isn't intended for SMBs, that's clear, then you
need loads
> > >>>>>>> of data to get ML working properly, and somewhere to
run those
> > >>>>>>> algorithms. That
> > >>>>>>>
> > >>>>>> is
> > >>>
> > >>>> Hadoop.
> > >>>>>>>
> > >>>>>>> Regards!
> > >>>>>>>
> > >>>>>>> Kenneth
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Sent from my Mi phone
> > >>>>>>> On kant kodali <kanth909@gmail.com>, Apr 14,
2017 4:04 AM wrote:
> > >>>>>>>
> > >>>>>>> Hi,
> > >>>>>>>
> > >>>>>>> Thanks for starting this thread. Here is my feedback.
> > >>>>>>>
> > >>>>>>> I somehow think the architecture is too complicated
for wide
> > >>>>>>> adoption since it requires to install the following.
> > >>>>>>>
> > >>>>>>> HDFS.
> > >>>>>>> HIVE.
> > >>>>>>> IMPALA.
> > >>>>>>> KAFKA.
> > >>>>>>> SPARK (YARN).
> > >>>>>>> YARN.
> > >>>>>>> Zookeeper.
> > >>>>>>>
> > >>>>>>> Currently there are way too many dependencies that
discourages
> > >>>>>>> lot of users from using it because they have to go
through
> > >>>>>>> deployment of all that required software. I think for
wide
> > >>>>>>> option we should minimize the dependencies and have
more
> > >>>>>>> pluggable architecture. for example I am
> > >>>>>>>
> > >>>>>> not
> > >>>
> > >>>> sure why HIVE & IMPALA both are required? why not just use
Spark
> > >>>> SQL
> > >>>>>>> since
> > >>>>>>> its already dependency or say users may want to use
their own
> > >>>>>>> distributed query engine they like such as Apache Drill
or
> > >>>>>>> something else. we should be flexible enough to provide
that
> > >>>>>>> option
> > >>>>>>>
> > >>>>>>> Also, I see that HDFS is used such that collectors
can receive
> > >>>>>>> file path's through Kafka and be able to read a file.
How big
> > >>>>>>> are these files ?
> > >>>>>>> Do we
> > >>>>>>> really need HDFS for this? Why not provide more ways
to send
> > >>>>>>> data such as sending data directly through Kafka or
say just
> > >>>>>>> leaving up to the user to specify the file location
as an
> > >>>>>>> argument to collector process
> > >>>>>>>
> > >>>>>>> Finally, I learnt that to generate Net flow data one
would
> > >>>>>>> require a specific hardware. This really means Apache
Spot is
> > >>>>>>> not meant for everyone.
> > >>>>>>> I thought Apache Spot can be used to analyze the network
traffic
> > >>>>>>> of
> > >>>>>>>
> > >>>>>> any
> > >>>
> > >>>> machine but if it requires a specific hard then I think it is
> > >>>>>>> targeted for
> > >>>>>>> specific group of people.
> > >>>>>>>
> > >>>>>>> The real strength of Apache Spot should mainly be just
analyzing
> > >>>>>>> network traffic through ML.
> > >>>>>>>
> > >>>>>>> Thanks!
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan
L <
> > >>>>>>> nathan.l.segerlind@intel.com> wrote:
> > >>>>>>>
> > >>>>>>> Thanks, Nate,
> > >>>>>>>>
> > >>>>>>>> Nate.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> -----Original Message-----
> > >>>>>>>> From: Nate Smith [mailto:natedogs911@gmail.com]
> > >>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
> > >>>>>>>> To: user@spot.incubator.apache.org
> > >>>>>>>> Cc: dev@spot.incubator.apache.org;
> > >>>>>>>>
> > >>>>>>> private@spot.incubator.apache.org
> > >>>
> > >>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
> > >>>>>>>>
> > >>>>>>>> I was really hoping it came through ok, Oh well
:) Here’s an
> > >>>>>>>> image form:
> > >>>>>>>> http://imgur.com/a/DUDsD
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan
L <
> > >>>>>>>>>
> > >>>>>>>> nathan.l.segerlind@intel.com> wrote:
> > >>>>>>>>
> > >>>>>>>>> The diagram became garbled in the text format.
> > >>>>>>>>> Could you resend it as a pdf?
> > >>>>>>>>>
> > >>>>>>>>> Thanks,
> > >>>>>>>>> Nate
> > >>>>>>>>>
> > >>>>>>>>> -----Original Message-----
> > >>>>>>>>> From: Nathanael Smith [mailto:nathanael@apache.org]
> > >>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
> > >>>>>>>>> To: private@spot.incubator.apache.org;
> > >>>>>>>>>
> > >>>>>>>> dev@spot.incubator.apache.org;
> > >>>>>>>
> > >>>>>>>> user@spot.incubator.apache.org
> > >>>>>>>>
> > >>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest
> > >>>>>>>>>
> > >>>>>>>>> How would you like to see Spot-ingest change?
> > >>>>>>>>>
> > >>>>>>>>> A. continue development on the Python Master/Worker
with focus
> > >>>>>>>>> on
> > >>>>>>>>>
> > >>>>>>>> performance / error handling / logging B. Develop
Scala based
> > >>>>>>>>
> > >>>>>>> ingest to
> > >>>>>>> be
> > >>>>>>>
> > >>>>>>>> inline with code base from ingest, ml, to OA (UI
to continue
> > >>>>>>>> being
> > >>>>>>>> ipython/JS) C. Python ingest Worker with Scala
based Spark code
> > >>>>>>>> for normalization and input into DB
> > >>>>>>>>
> > >>>>>>>>> Including the high level diagram:
> > >>>>>>>>> +-----------------------------------------------------------
> > >>>>>>>>>
> > >>>>>>>> -------------------------------+
> > >>>>>>>>
> > >>>>>>>>> | +--------------------------+
> > >>>>>>>>>
> > >>>>>>>> +-----------------+        |
> > >>>>>>>>
> > >>>>>>>>> | | Master                   |  A. B. C.
>   |
> > >>>>>>>>>
> > >>>>>>>> Worker          |        |
> > >>>>>>>>
> > >>>>>>>>> | |    A. Python             +---------------+
     A.
> > >>>>>>>>>
> > >>>>>>>> |   A.
> > >>>>>>>
> > >>>>>>>> Python     |        |
> > >>>>>>>>
> > >>>>>>>>> | |    B. Scala              |            
  |
> +------------->
> > >>>>>>>>>
> > >>>>>>>>           +----+   |
> > >>>>>>>>
> > >>>>>>>>> | |    C. Python             |            
  |    |
>  |
> > >>>>>>>>>
> > >>>>>>>>           |    |   |
> > >>>>>>>>
> > >>>>>>>>> | +---^------+---------------+            
  |    |
> > >>>>>>>>>
> > >>>>>>>>   +-----------------+    |   |
> > >>>>>>>>
> > >>>>>>>>> |     |      |                            
  |    |
> > >>>>>>>>>
> > >>>>>>>>                |   |
> > >>>>>>>>
> > >>>>>>>>> |     |      |                            
  |    |
> > >>>>>>>>>
> > >>>>>>>>                |   |
> > >>>>>>>>
> > >>>>>>>>> |     |     +Note--------------+          
  |    |
> > >>>>>>>>>
> > >>>>>>>>   +-----------------+    |   |
> > >>>>>>>>
> > >>>>>>>>> |     |     |Running on a      |          
  |    |
>  |
> > >>>>>>>>>
> > >>>>>>>> Spark
> > >>>>>>>
> > >>>>>>>> Streaming |    |   |
> > >>>>>>>>
> > >>>>>>>>> |     |     |worker node in    |          
  |    |      B. C.
> > >>>>>>>>>
> > >>>>>>>> | B.
> > >>>>>>>
> > >>>>>>>> Scala        |    |   |
> > >>>>>>>>
> > >>>>>>>>> |     |     |the Hadoop cluster|          
  |    |
> > >>>>>>>>>
> > >>>>>>>> +--------> C.
> > >>>>>>>
> > >>>>>>>> Scala        +-+  |   |
> > >>>>>>>>
> > >>>>>>>>> |     |     +------------------+          
  |    |    |
>   |
> > >>>>>>>>>
> > >>>>>>>>           | |  |   |
> > >>>>>>>>
> > >>>>>>>>> |   A.|                                   
  |    |    |
> > >>>>>>>>>
> > >>>>>>>> +-----------------+ |  |   |
> > >>>>>>>>
> > >>>>>>>>> |   B.|                                   
  |    |    |
> > >>>>>>>>>
> > >>>>>>>>              |  |   |
> > >>>>>>>>
> > >>>>>>>>> |   C.|                                   
  |    |    |
> > >>>>>>>>>
> > >>>>>>>>              |  |   |
> > >>>>>>>>
> > >>>>>>>>> | +----------------------+          +-v------+----+----+-+
> > >>>>>>>>>
> > >>>>>>>>   +--------------v--v-+ |
> > >>>>>>>>
> > >>>>>>>>> | |                      |          |
> > >>>>>>>>>
> > >>>>>>>> |           |
> > >>>>>>>
> > >>>>>>>>                   | |
> > >>>>>>>>
> > >>>>>>>>> | |   Local FS:          |          |    hdfs
> > >>>>>>>>>
> > >>>>>>>> |           |
> > >>>>>>>
> > >>>>>>>> Hive / Impala    | |
> > >>>>>>>>
> > >>>>>>>>> | |  - Binary/Text       |          |
> > >>>>>>>>>
> > >>>>>>>> |           |
> > >>>>>>>
> > >>>>>>>>   - Parquet -     | |
> > >>>>>>>>
> > >>>>>>>>> | |    Log files -       |          |
> > >>>>>>>>>
> > >>>>>>>> |           |
> > >>>>>>>
> > >>>>>>>>                   | |
> > >>>>>>>>
> > >>>>>>>>> | |                      |          |
> > >>>>>>>>>
> > >>>>>>>> |           |
> > >>>>>>>
> > >>>>>>>>                   | |
> > >>>>>>>>
> > >>>>>>>>> | +----------------------+          +--------------------+
> > >>>>>>>>>
> > >>>>>>>>   +-------------------+ |
> > >>>>>>>>
> > >>>>>>>>> +-----------------------------------------------------------
> > >>>>>>>>>
> > >>>>>>>> -------------------------------+
> > >>>>>>>>
> > >>>>>>>>> Please let me know your thoughts,
> > >>>>>>>>>
> > >>>>>>>>> - Nathanael
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>
> > >
> >
>
-- 

Jon

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message