spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kenn...@floss.cat
Subject Re: [Discuss] - Future plans for Spot-ingest
Date Wed, 19 Apr 2017 20:22:09 GMT

Replying to myself, AVRO for ingestion, Parquet for storage.

Regards!

Kenneth

A 2017-04-19 22:05, Austin Leahy escrigué:
> I think Kafka is probably a red herring. It's an industry goto in the
> application world because because of redundancy but the type and 
> volumes of
> network telemetry that we are talking about here will bog kafka down 
> unless
> you dedicate really serious hardware to just the kafka implementation. 
> It's
> essentially the next level of the problem that the team was already 
> running
> into when rabbitMQ was queueing in data.
> 
> On Wed, Apr 19, 2017 at 12:33 PM Mark Grover <mark@apache.org> wrote:
> 
>> On Wed, Apr 19, 2017 at 10:19 AM, Smith, Nathanael P <
>> nathanael.p.smith@intel.com> wrote:
>> 
>> > Mark,
>> >
>> > just digesting the below.
>> >
>> > Backing up in my thought process, I was thinking that the ingest master
>> > (first point of entry into the system) would want to put the data into a
>> > standard serializable format. I was thinking that libraries (such as
>> > pyarrow in this case) could help by writing the data in parquet format
>> > early in the process. You are probably correct that at this point in time
>> > it might not be worth the time and can be kept in the backlog.
>> > That being said, I still think the master should produce data in a
>> > standard format, what in your opinion (and I open this up of course to
>> > others) would be the most logical format?
>> > the most basic would be to just keep it as a .csv.
>> >
>> > The master will likely write data to a staging directory in HDFS where
>> the
>> > spark streaming job will pick it up for normalization/writing to parquet
>> in
>> > the correct block sizes and partitions.
>> >
>> 
>> Hi Nate,
>> Avro is usually preferred for such a standard format - because it 
>> asserts a
>> schema (types, etc.) which CSV doesn't and it allows for schema 
>> evolution
>> which depending on the type of evolution, CSV may or may not support.
>> And, that's something I have seen being done very commonly.
>> 
>> Now, if the data were in Kafka before it gets to master, one could 
>> argue
>> that the master could just send metadata to the workers (topic name,
>> partition number, offset start and end) and the workers could read 
>> from
>> Kafka directly. I do understand that'd be a much different 
>> architecture
>> than the current one, but if you think it's a good idea too, we could
>> document that, say in a JIRA, and (de-)prioritize it (and in line with 
>> the
>> rest of the discussion on this thread, it's not the top-most 
>> priority).
>> Thoughts?
>> 
>> - Nathanael
>> >
>> >
>> >
>> > > On Apr 17, 2017, at 1:12 PM, Mark Grover <mark@apache.org> wrote:
>> > >
>> > > Thanks all your opinion.
>> > >
>> > > I think it's good to consider two things:
>> > > 1. What do (we think) users care about?
>> > > 2. What's the cost of changing things?
>> > >
>> > > About #1, I think users care more about what format data is written
>> than
>> > > how the data is written. I'd argue whether that uses Hive, MR, or a
>> > custom
>> > > Parquet writer is not as important to them as long as we maintain
>> > > data/format compatibility.
>> > > About #2, having worked on several projects, I find that it's rather
>> > > difficult to keep up with Parquet. Even in Spark, there are a few
>> > different
>> > > ways to write to Parquet - there's a regular mode, and a legacy mode
>> > > <https://github.com/apache/spark/blob/master/sql/core/
>> > src/main/scala/org/apache/spark/sql/execution/datasources/parquet/
>> > ParquetWriteSupport.scala#L44>
>> > > which
>> > > continues to cause confusion
>> > > <https://issues.apache.org/jira/browse/SPARK-20297> till date. Parquet
>> > > itself is pretty dependent on Hadoop
>> > > <https://github.com/Parquet/parquet-mr/search?l=Maven+POM&
>> > q=hadoop&type=&utf8=%E2%9C%93>
>> > > and,
>> > > just integrating it with systems with a lot of developers (like Spark
>> > > <https://www.google.com/webhp?sourceid=chrome-instant&ion=1&
>> > espv=2&ie=UTF-8#q=spark+parquet+jiras>)
>> > > is still a lot of work.
>> > >
>> > > I personally think we should leverage higher level tools like Hive, or
>> > > Spark to write data in widespread formats (Parquet, being a very good
>> > > example) but I personally wouldn't encourage us to manage the writers
>> > > ourselves.
>> > >
>> > > Thoughts?
>> > > Mark
>> > >
>> > > On Mon, Apr 17, 2017 at 11:44 AM, Michael Ridley <mridley@cloudera.com
>> >
>> > > wrote:
>> > >
>> > >> Without having given it too terribly much thought, that seems like an
>> OK
>> > >> approach.
>> > >>
>> > >> Michael
>> > >>
>> > >> On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith <
>> nathanael@apache.org>
>> > >> wrote:
>> > >>
>> > >>> I think the question is rather we can write the data generically to
>> > HDFS
>> > >>> as parquet without the use of hive/impala?
>> > >>>
>> > >>> Today we write parquet data using the hive/mapreduce method.
>> > >>> As part of the redesign i’d like to use libraries for this as opposed
>> > to
>> > >> a
>> > >>> hadoop dependency.
>> > >>> I think it would be preferred to use the python master to write the
>> > data
>> > >>> into the format we want, then do normalization of the data in spark
>> > >>> streaming.
>> > >>> Any thoughts?
>> > >>>
>> > >>> - Nathanael
>> > >>>
>> > >>>
>> > >>>
>> > >>>> On Apr 17, 2017, at 11:08 AM, Michael Ridley <mridley@cloudera.com>
>> > >>> wrote:
>> > >>>>
>> > >>>> I had thought that the plan was to write the data in Parquet in HDFS
>> > >>>> ultimately.
>> > >>>>
>> > >>>> Michael
>> > >>>>
>> > >>>> On Sun, Apr 16, 2017 at 11:55 AM, kant kodali <kanth909@gmail.com>
>> > >>> wrote:
>> > >>>>
>> > >>>>> Hi Mark,
>> > >>>>>
>> > >>>>> Thank you so much for hearing my argument. And I definetly
>> understand
>> > >>> that
>> > >>>>> you guys have bunch of things to do. My only concern is that I hope
>> > it
>> > >>>>> doesn't take too long too support other backends. For example
>> > @Kenneth
>> > >>> had
>> > >>>>> given an example of LAMP stack had not moved away from mysql yet
>> > which
>> > >>>>> essentially means its probably a decade ? I see that in the current
>> > >>>>> architecture the results from with python multiprocessing or Spark
>> > >>>>> Streaming are written back to HDFS and  If so, can we write them in
>> > >>> parquet
>> > >>>>> format ? such that users should be able to plug in any query engine
>> > >> but
>> > >>>>> again I am not pushing you guys to do this right away or anything
>> > just
>> > >>>>> seeing if there a way for me to get started in parallel and if not
>> > >>>>> feasible, its fine I just wanted to share my 2 cents and I am glad
>> my
>> > >>>>> argument is heard!
>> > >>>>>
>> > >>>>> Thanks much!
>> > >>>>>
>> > >>>>> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <mark@apache.org>
>> > wrote:
>> > >>>>>
>> > >>>>>> Hi Kant,
>> > >>>>>> Just wanted to make sure you don't feel like we are ignoring your
>> > >>>>>> comment:-) I hear you about pluggability.
>> > >>>>>>
>> > >>>>>> The design can and should be pluggable but the project has one
>> stack
>> > >> it
>> > >>>>>> ships out of the box with, one stack that's the default stack in
>> the
>> > >>>>> sense
>> > >>>>>> that it's the most tested and so on. And, for us, that's our
>> current
>> > >>>>> stack.
>> > >>>>>> If we were to take Apache Hive as an example, it shipped (and
>> ships)
>> > >>> with
>> > >>>>>> MapReduce as the default configuration engine. At some point,
>> Apache
>> > >>> Tez
>> > >>>>>> came along and wanted Hive to run on Tez, so they made a bunch of
>> > >>> things
>> > >>>>>> pluggable to run Hive on Tez (instead of the only option up-until
>> > >> then:
>> > >>>>>> Hive-on-MR) and then Apache Spark came and re-used some of that
>> > >>>>>> pluggability and even added some more so Hive-on-Spark could
>> become
>> > a
>> > >>>>>> reality. In the same way, I don't think anyone disagrees here that
>> > >>>>>> pluggabilty is a good thing but it's hard to do pluggability
>> right,
>> > >> and
>> > >>>>> at
>> > >>>>>> the right level, unless on has a clear use-case in mind.
>> > >>>>>>
>> > >>>>>> As a project, we have many things to do and I personally think the
>> > >>>>> biggest
>> > >>>>>> bang for the buck for us in making Spot a really solid and the
>> best
>> > >>> cyber
>> > >>>>>> security solution isn't pluggability but the things we are working
>> > on
>> > >>> - a
>> > >>>>>> better user interface, a common/unified approach to storing and
>> > >>> modeling
>> > >>>>>> data, etc.
>> > >>>>>>
>> > >>>>>> Having said that, we are open, if it's important to you or someone
>> > >>> else,
>> > >>>>>> we'd be happy to receive and review those patches.
>> > >>>>>>
>> > >>>>>> Thanks!
>> > >>>>>> Mark
>> > >>>>>>
>> > >>>>>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali <kanth909@gmail.com
>> >
>> > >>>>> wrote:
>> > >>>>>>
>> > >>>>>>> Thanks Ross! and yes option C sounds good to me as well however I
>> > >> just
>> > >>>>>>> think Distributed Sql query engine  and the resource manager
>> should
>> > >> be
>> > >>>>>>> pluggable.
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <
>> > >> alan.d.ross@intel.com>
>> > >>>>>>> wrote:
>> > >>>>>>>
>> > >>>>>>>> Option C is to use python on the front end of ingest pipeline
>> and
>> > >>>>>>>> spark/scala on the back end.
>> > >>>>>>>>
>> > >>>>>>>> Option A uses python workers on the backend
>> > >>>>>>>>
>> > >>>>>>>> Option B uses all scala.
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> -----Original Message-----
>> > >>>>>>>> From: kant kodali [mailto:kanth909@gmail.com]
>> > >>>>>>>> Sent: Friday, April 14, 2017 9:53 AM
>> > >>>>>>>> To: dev@spot.incubator.apache.org
>> > >>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
>> > >>>>>>>>
>> > >>>>>>>> What is option C ? am I missing an email or something?
>> > >>>>>>>>
>> > >>>>>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai <
>> > >>>>>>>> chokha@integralops.com> wrote:
>> > >>>>>>>>
>> > >>>>>>>>> +1 for Python 3.x
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote:
>> > >>>>>>>>>
>> > >>>>>>>>>> I think that C is the strong solution, getting the ingest
>> really
>> > >>>>>>>>>> strong is going to lower barriers to adoption. Doing it in
>> > Python
>> > >>>>>>>>>> will open up the ingest portion of the project to include many
>> > >>>>> more
>> > >>>>>>>> developers.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Before it comes up I would like to throw the following on the
>> > >>>>>> pile...
>> > >>>>>>>>>> Major
>> > >>>>>>>>>> python projects django/flash, others are dropping 2.x support
>> in
>> > >>>>>>>>>> releases scheduled in the next 6 to 8 months. Hadoop projects
>> in
>> > >>>>>>>>>> general tend to lag in modern python support, lets please
>> build
>> > >>>>> this
>> > >>>>>>>>>> in 3.5 so that we don't have to immediately expect a rebuild
>> in
>> > >>>>> the
>> > >>>>>>>>>> pipeline.
>> > >>>>>>>>>>
>> > >>>>>>>>>> -Vote C
>> > >>>>>>>>>>
>> > >>>>>>>>>> Thanks Nate
>> > >>>>>>>>>>
>> > >>>>>>>>>> Austin
>> > >>>>>>>>>>
>> > >>>>>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <alan@apache.org>
>> > >>>>> wrote:
>> > >>>>>>>>>>
>> > >>>>>>>>>> I really like option C because it gives a lot of flexibility
>> for
>> > >>>>>>>>>> ingest
>> > >>>>>>>>>>> (python vs scala) but still has the robust spark streaming
>> > >>>>> backend
>> > >>>>>>>>>>> for performance.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Thanks for putting this together Nate.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Alan
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
>> > >>>>>>>>>>> chokha@integralops.com> wrote:
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> I agree. We should continue making the existing stack more
>> > >> mature
>> > >>>>>> at
>> > >>>>>>>>>>>> this point. Maybe if we have enough community support we can
>> > >> add
>> > >>>>>>>>>>>> additional datastores.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> Chokha.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> On 4/14/17 11:10 AM, kenneth@floss.cat wrote:
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> Hi Kant,
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> YARN is the standard scheduler in Hadoop. If you're using
>> > >>>>>>>>>>>>> Hive+Spark, then sure you'll have YARN.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is
>> based
>> > >>>>> on
>> > >>>>>> a
>> > >>>>>>>>>>>>> quite standard Hadoop stack and I wouldn't switch too many
>> > >>>>> pieces
>> > >>>>>>>> yet.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> In most Opensource projects you start relying on a
>> well-known
>> > >>>>>>>>>>>>> stack and then you begin to support other DB backends once
>> > >> it's
>> > >>>>>>>>>>>>> quite mature. Think in the loads of LAMP apps which haven't
>> > >>>>> been
>> > >>>>>>>>>>>>> ported away from MySQL yet.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> In any case, you'll need a high performance SQL + Massive
>> > >>>>> Storage
>> > >>>>>>>>>>>>> + Machine Learning + Massive Ingestion, and... ATM, that
>> can
>> > >> be
>> > >>>>>>>>>>>>> only provided by Hadoop.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> Regards!
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> Kenneth
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué:
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Hi Kenneth,
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Thanks for the response.  I think you made a case for HDFS
>> > >>>>>>>>>>>>>> however users may want to use S3 or some other FS in which
>> > >>>>> case
>> > >>>>>>>>>>>>>> they can use Auxilio (hoping that there are no changes
>> > needed
>> > >>>>>>>>>>>>>> within Spot in which case I
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> can
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> agree to that). for example, Netflix stores all there data
>> > into
>> > >>>>> S3
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> The distributed sql query engine I would say should be
>> > >>>>> pluggable
>> > >>>>>>>>>>>>>> with whatever user may want to use and there a bunch of
>> them
>> > >>>>> out
>> > >>>>>>>> there.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> sure
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> Impala is better than hive but what if users are already
>> using
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> something
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> else like Drill or Presto?
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Me personally, would not assume that users are willing to
>> > >>>>> deploy
>> > >>>>>>>>>>>>>> all
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> of
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> that and make their existing stack more complicated at very
>> > >>>>> least
>> > >>>>>> I
>> > >>>>>>>>>>>>>> would
>> > >>>>>>>>>>>>>> say it is a uphill battle. Things have been changing
>> rapidly
>> > >>>>> in
>> > >>>>>>>>>>>>>> Big
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> data
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> space so whatever we think is standard won't be standard
>> > >> anymore
>> > >>>>>>>>>>>> but
>> > >>>>>>>>>>>>>> importantly there shouldn't be any reason why we shouldn't
>> > be
>> > >>>>>>>>>>>>>> flexible right.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Also I am not sure why only YARN? why not make that also
>> > more
>> > >>>>>>>>>>>>>> flexible so users can pick Mesos or standalone.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> I think Flexibility is a key for a wide adoption rather
>> than
>> > >>>>> the
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> tightly
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> coupled architecture.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Thanks!
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza
>> > >>>>>>>>>>>>>> <kenneth@floss.cat>
>> > >>>>>>>>>>>>>> wrote:
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> PS: you need a big data platform to be able to collect all
>> > >>>>> those
>> > >>>>>>>>>>>>>>> netflows
>> > >>>>>>>>>>>>>>> and logs.
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Spot isn't intended for SMBs, that's clear, then you need
>> > >>>>> loads
>> > >>>>>>>>>>>>>>> of data to get ML working properly, and somewhere to run
>> > >>>>> those
>> > >>>>>>>>>>>>>>> algorithms. That
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> is
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> Hadoop.
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Regards!
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Kenneth
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Sent from my Mi phone
>> > >>>>>>>>>>>>>>> On kant kodali <kanth909@gmail.com>, Apr 14, 2017 4:04
>> AM
>> > >>>>>> wrote:
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Hi,
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Thanks for starting this thread. Here is my feedback.
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> I somehow think the architecture is too complicated for
>> > wide
>> > >>>>>>>>>>>>>>> adoption since it requires to install the following.
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> HDFS.
>> > >>>>>>>>>>>>>>> HIVE.
>> > >>>>>>>>>>>>>>> IMPALA.
>> > >>>>>>>>>>>>>>> KAFKA.
>> > >>>>>>>>>>>>>>> SPARK (YARN).
>> > >>>>>>>>>>>>>>> YARN.
>> > >>>>>>>>>>>>>>> Zookeeper.
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Currently there are way too many dependencies that
>> > >>>>> discourages
>> > >>>>>>>>>>>>>>> lot of users from using it because they have to go
>> through
>> > >>>>>>>>>>>>>>> deployment of all that required software. I think for
>> wide
>> > >>>>>>>>>>>>>>> option we should minimize the dependencies and have more
>> > >>>>>>>>>>>>>>> pluggable architecture. for example I am
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> not
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> sure why HIVE & IMPALA both are required? why not just use
>> > >> Spark
>> > >>>>>>>>>>>> SQL
>> > >>>>>>>>>>>>>>> since
>> > >>>>>>>>>>>>>>> its already dependency or say users may want to use their
>> > >> own
>> > >>>>>>>>>>>>>>> distributed query engine they like such as Apache Drill
>> or
>> > >>>>>>>>>>>>>>> something else. we should be flexible enough to provide
>> > that
>> > >>>>>>>>>>>>>>> option
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Also, I see that HDFS is used such that collectors can
>> > >>>>> receive
>> > >>>>>>>>>>>>>>> file path's through Kafka and be able to read a file. How
>> > >> big
>> > >>>>>>>>>>>>>>> are these files ?
>> > >>>>>>>>>>>>>>> Do we
>> > >>>>>>>>>>>>>>> really need HDFS for this? Why not provide more ways to
>> > send
>> > >>>>>>>>>>>>>>> data such as sending data directly through Kafka or say
>> > just
>> > >>>>>>>>>>>>>>> leaving up to the user to specify the file location as an
>> > >>>>>>>>>>>>>>> argument to collector process
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Finally, I learnt that to generate Net flow data one
>> would
>> > >>>>>>>>>>>>>>> require a specific hardware. This really means Apache
>> Spot
>> > >> is
>> > >>>>>>>>>>>>>>> not meant for everyone.
>> > >>>>>>>>>>>>>>> I thought Apache Spot can be used to analyze the network
>> > >>>>>> traffic
>> > >>>>>>>>>>>>>>> of
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> any
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> machine but if it requires a specific hard then I think it
>> is
>> > >>>>>>>>>>>>>>> targeted for
>> > >>>>>>>>>>>>>>> specific group of people.
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> The real strength of Apache Spot should mainly be just
>> > >>>>>> analyzing
>> > >>>>>>>>>>>>>>> network traffic through ML.
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Thanks!
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L <
>> > >>>>>>>>>>>>>>> nathan.l.segerlind@intel.com> wrote:
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Thanks, Nate,
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> Nate.
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> -----Original Message-----
>> > >>>>>>>>>>>>>>>> From: Nate Smith [mailto:natedogs911@gmail.com]
>> > >>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
>> > >>>>>>>>>>>>>>>> To: user@spot.incubator.apache.org
>> > >>>>>>>>>>>>>>>> Cc: dev@spot.incubator.apache.org;
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> private@spot.incubator.apache.org
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> I was really hoping it came through ok, Oh well :)
>> Here’s
>> > >> an
>> > >>>>>>>>>>>>>>>> image form:
>> > >>>>>>>>>>>>>>>> http://imgur.com/a/DUDsD
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L <
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> nathan.l.segerlind@intel.com> wrote:
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> The diagram became garbled in the text format.
>> > >>>>>>>>>>>>>>>>> Could you resend it as a pdf?
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> Thanks,
>> > >>>>>>>>>>>>>>>>> Nate
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> -----Original Message-----
>> > >>>>>>>>>>>>>>>>> From: Nathanael Smith [mailto:nathanael@apache.org]
>> > >>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
>> > >>>>>>>>>>>>>>>>> To: private@spot.incubator.apache.org;
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> dev@spot.incubator.apache.org;
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> user@spot.incubator.apache.org
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> How would you like to see Spot-ingest change?
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> A. continue development on the Python Master/Worker
>> with
>> > >>>>>> focus
>> > >>>>>>>>>>>>>>>>> on
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> performance / error handling / logging B. Develop Scala
>> > >>>>> based
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> ingest to
>> > >>>>>>>>>>>>>>> be
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> inline with code base from ingest, ml, to OA (UI to
>> > >> continue
>> > >>>>>>>>>>>>>>>> being
>> > >>>>>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with Scala based
>> Spark
>> > >>>>>> code
>> > >>>>>>>>>>>>>>>> for normalization and input into DB
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> Including the high level diagram:
>> > >>>>>>>>>>>>>>>>> +-----------------------------
>> > >>>>> ------------------------------
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> -------------------------------+
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> | +--------------------------+
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> +-----------------+        |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> | | Master                   |  A. B. C.
>> > >>>>>>> |
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> Worker          |        |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> | |    A. Python             +---------------+      A.
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> |   A.
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> Python     |        |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> | |    B. Scala              |               |
>> > >>>>>>> +------------->
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>         +----+   |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> | |    C. Python             |               |    |
>> > >>>>>>> |
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>         |    |   |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> | +---^------+---------------+               |    |
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> +-----------------+    |   |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> |     |      |                               |    |
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>              |   |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> |     |      |                               |    |
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>              |   |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> |     |     +Note--------------+             |    |
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> +-----------------+    |   |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> |     |     |Running on a      |             |    |
>> > >>>>>>> |
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> Spark
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> Streaming |    |   |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> |     |     |worker node in    |             |    |
>> > >> B.
>> > >>>>>> C.
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> | B.
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> Scala        |    |   |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> |     |     |the Hadoop cluster|             |    |
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> +--------> C.
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> Scala        +-+  |   |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> |     |     +------------------+             |    |
>> |
>> > >>>>>>> |
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>         | |  |   |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> |   A.|                                      |    |
>> |
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> +-----------------+ |  |   |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> |   B.|                                      |    |
>> |
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>            |  |   |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> |   C.|                                      |    |
>> |
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>            |  |   |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> | +----------------------+
>> > +-v------+----+----+-+
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> +--------------v--v-+ |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> | |                      |          |
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> |           |
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>                 | |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> | |   Local FS:          |          |    hdfs
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> |           |
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> Hive / Impala    | |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> | |  - Binary/Text       |          |
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> |           |
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> - Parquet -     | |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> | |    Log files -       |          |
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> |           |
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>                 | |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> | |                      |          |
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> |           |
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>                 | |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> | +----------------------+
>> > +--------------------+
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> +-------------------+ |
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> +-----------------------------
>> > >>>>> ------------------------------
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> -------------------------------+
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> Please let me know your thoughts,
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> - Nathanael
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>
>> > >>>>>>
>> > >>>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> --
>> > >>>> Michael Ridley <mridley@cloudera.com>
>> > >>>> office: (650) 352-1337
>> > >>>> mobile: (571) 438-2420
>> > >>>> Senior Solutions Architect
>> > >>>> Cloudera, Inc.
>> > >>>
>> > >>>
>> > >>
>> > >>
>> > >> --
>> > >> Michael Ridley <mridley@cloudera.com>
>> > >> office: (650) 352-1337
>> > >> mobile: (571) 438-2420
>> > >> Senior Solutions Architect
>> > >> Cloudera, Inc.
>> > >>
>> >
>> >
>> 


Mime
View raw message