spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kant kodali <kanth...@gmail.com>
Subject Re: [Discuss] - Future plans for Spot-ingest
Date Fri, 21 Apr 2017 14:33:28 GMT
@Micheal Ridley There are few ways to do this.

 1. There is File Source in Connector in Kafka Connect!


http://docs.confluent.io/3.1.0/connect/connect-filestream/filestream_connector.html#filesource-connector

 2. Any file listener API from any language  combined with Kafka Producer
would also work!

https://github.com/seb-m/pyinotify/wiki
https://github.com/seb-m/pyinotify/wiki/List-of-Examples


On Thu, Apr 20, 2017 at 7:49 PM, Smith, Nathanael P <
nathanael.p.smith@intel.com> wrote:

> If you want to code a quick POC I will run it on our data. This sounds
> great Austin.
>
> - nathanael
>
> > On Apr 20, 2017, at 2:41 PM, Austin Leahy <Austin@digitalminion.com>
> wrote:
> >
> > So this is basically why the flume suggestion has come up. Flume natively
> > acts as a syslog listener and will write files to basically anything
> (HDFS,
> > Hive, HBase, S3).
> >
> >> On Thu, Apr 20, 2017 at 8:15 AM Michael Ridley <mridley@cloudera.com>
> wrote:
> >>
> >> When we say ingest from Kafka, what does that mean?  I understand we can
> >> read from Kafka to ingest into the cluster, but how will the data get to
> >> Kafka and what data are we talking about?  My understanding is that
> right
> >> now the primary data sources would be Netflow and Syslog, neither of
> which
> >> writes to Kafka natively so we would need something like StreamSets in
> the
> >> middle.  Certainly StreamSets UDP source -> Kafka would work.
> >>
> >> Michael
> >>
> >>> On Wed, Apr 19, 2017 at 7:05 PM, kant kodali <kanth909@gmail.com>
> wrote:
> >>>
> >>> sure I guess Kafka has something called Kafka connect but may not be as
> >>> mature as flume since I heard about this recently.
> >>>
> >>> On Wed, Apr 19, 2017 at 3:39 PM, Austin Leahy <
> Austin@digitalminion.com>
> >>> wrote:
> >>>
> >>>> The advantage of flume or a flume Kafka hybrid is that the team
> doesn't
> >>>> have to build sinks for any new source types added to the project just
> >>>> create configs pointing to the landing pad
> >>>> On Wed, Apr 19, 2017 at 3:31 PM kant kodali <kanth909@gmail.com>
> >> wrote:
> >>>>
> >>>>> What kind of benchmarks are we looking for? just throughput? since I
> >> am
> >>>>> assuming this is for ingestion. I haven't seen anything faster than
> >>> Kafka
> >>>>> and that is because of its simplicity after all publisher appends
> >>> message
> >>>>> to a file(so called the partition in kafka) and clients just do
> >>>> sequential
> >>>>> reads from a file so its a matter of disk throughput. The benchmark
> >>>> numbers
> >>>>> I have for Kafka is at very least 75K messages/sec where each message
> >>> is
> >>>>> 1KB on m4.xlarge which by default has EBS storage (EBS is
> >>>> network-attached
> >>>>> SSD disk). The network attached disk has a max throughput of
> >>>>> 125MB/s(m4.xlarge has 1Gigabit) but if we were deploy it on ephemeral
> >>>>> storage (local-SSD) and on a 10 Gigabit Network we would easily get
> >>> 5-10X
> >>>>> more.
> >>>>>
> >>>>> No idea about flume.
> >>>>>
> >>>>> Finally, not trying to pitch for Kafka however it is fastest I have
> >>> seen
> >>>>> but if someone has better numbers for flume then we should use that.
> >>>> Also I
> >>>>> would suspect there are benchmarks for Kafka vs Flume available
> >> online
> >>>>> already or we can try it with our own datasets.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> On Wed, Apr 19, 2017 at 3:09 PM, Austin Leahy <
> >>> Austin@digitalminion.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I am happy to create and test a flume source... #intelteam would
> >> need
> >>>> to
> >>>>>> create the benchmark by deploying it and pointing a data source at
> >>>> it...
> >>>>>> since I don't have good enough volume of source data handy
> >>>>>> On Wed, Apr 19, 2017 at 3:04 PM Ross, Alan D <
> >> alan.d.ross@intel.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> We discussed this in our staff meeting a bit today.  I would like
> >>> to
> >>>>> see
> >>>>>>> some benchmarking of different approaches (kafka, flume, etc) to
> >>> see
> >>>>> what
> >>>>>>> the numbers look like. Is anyone in the community willing to
> >>>> volunteer
> >>>>> on
> >>>>>>> this work?
> >>>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Austin Leahy [mailto:Austin@digitalminion.com]
> >>>>>>> Sent: Wednesday, April 19, 2017 1:05 PM
> >>>>>>> To: dev@spot.incubator.apache.org
> >>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
> >>>>>>>
> >>>>>>> I think Kafka is probably a red herring. It's an industry goto in
> >>> the
> >>>>>>> application world because because of redundancy but the type and
> >>>>> volumes
> >>>>>> of
> >>>>>>> network telemetry that we are talking about here will bog kafka
> >>> down
> >>>>>> unless
> >>>>>>> you dedicate really serious hardware to just the kafka
> >>>> implementation.
> >>>>>> It's
> >>>>>>> essentially the next level of the problem that the team was
> >> already
> >>>>>> running
> >>>>>>> into when rabbitMQ was queueing in data.
> >>>>>>>
> >>>>>>> On Wed, Apr 19, 2017 at 12:33 PM Mark Grover <mark@apache.org>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> On Wed, Apr 19, 2017 at 10:19 AM, Smith, Nathanael P <
> >>>>>>>> nathanael.p.smith@intel.com> wrote:
> >>>>>>>>
> >>>>>>>>> Mark,
> >>>>>>>>>
> >>>>>>>>> just digesting the below.
> >>>>>>>>>
> >>>>>>>>> Backing up in my thought process, I was thinking that the
> >>> ingest
> >>>>>>>>> master (first point of entry into the system) would want to
> >> put
> >>>> the
> >>>>>>>>> data into a standard serializable format. I was thinking that
> >>>>>>>>> libraries (such as pyarrow in this case) could help by
> >> writing
> >>>> the
> >>>>>>>>> data in parquet format early in the process. You are probably
> >>>>>>>>> correct that at this point in time it might not be worth the
> >>> time
> >>>>> and
> >>>>>>> can be kept in the backlog.
> >>>>>>>>> That being said, I still think the master should produce data
> >>> in
> >>>> a
> >>>>>>>>> standard format, what in your opinion (and I open this up of
> >>>> course
> >>>>>>>>> to
> >>>>>>>>> others) would be the most logical format?
> >>>>>>>>> the most basic would be to just keep it as a .csv.
> >>>>>>>>>
> >>>>>>>>> The master will likely write data to a staging directory in
> >>> HDFS
> >>>>>>>>> where
> >>>>>>>> the
> >>>>>>>>> spark streaming job will pick it up for normalization/writing
> >>> to
> >>>>>>>>> parquet
> >>>>>>>> in
> >>>>>>>>> the correct block sizes and partitions.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Nate,
> >>>>>>>> Avro is usually preferred for such a standard format - because
> >> it
> >>>>>>>> asserts a schema (types, etc.) which CSV doesn't and it allows
> >>> for
> >>>>>>>> schema evolution which depending on the type of evolution, CSV
> >>> may
> >>>> or
> >>>>>>> may not support.
> >>>>>>>> And, that's something I have seen being done very commonly.
> >>>>>>>>
> >>>>>>>> Now, if the data were in Kafka before it gets to master, one
> >>> could
> >>>>>>>> argue that the master could just send metadata to the workers
> >>>> (topic
> >>>>>>>> name, partition number, offset start and end) and the workers
> >>> could
> >>>>>>>> read from Kafka directly. I do understand that'd be a much
> >>>> different
> >>>>>>>> architecture than the current one, but if you think it's a good
> >>>> idea
> >>>>>>>> too, we could document that, say in a JIRA, and (de-)prioritize
> >>> it
> >>>>>>>> (and in line with the rest of the discussion on this thread,
> >> it's
> >>>> not
> >>>>>>> the top-most priority).
> >>>>>>>> Thoughts?
> >>>>>>>>
> >>>>>>>> - Nathanael
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> On Apr 17, 2017, at 1:12 PM, Mark Grover <mark@apache.org>
> >>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Thanks all your opinion.
> >>>>>>>>>>
> >>>>>>>>>> I think it's good to consider two things:
> >>>>>>>>>> 1. What do (we think) users care about?
> >>>>>>>>>> 2. What's the cost of changing things?
> >>>>>>>>>>
> >>>>>>>>>> About #1, I think users care more about what format data is
> >>>>>>>>>> written
> >>>>>>>> than
> >>>>>>>>>> how the data is written. I'd argue whether that uses Hive,
> >>> MR,
> >>>> or
> >>>>>>>>>> a
> >>>>>>>>> custom
> >>>>>>>>>> Parquet writer is not as important to them as long as we
> >>>> maintain
> >>>>>>>>>> data/format compatibility.
> >>>>>>>>>> About #2, having worked on several projects, I find that
> >> it's
> >>>>>>>>>> rather difficult to keep up with Parquet. Even in Spark,
> >>> there
> >>>>> are
> >>>>>>>>>> a few
> >>>>>>>>> different
> >>>>>>>>>> ways to write to Parquet - there's a regular mode, and a
> >>> legacy
> >>>>>>>>>> mode <
> >> https://github.com/apache/spark/blob/master/sql/core/
> >>>>>>>>> src/main/scala/org/apache/spark/sql/execution/
> >>>> datasources/parquet/
> >>>>>>>>> ParquetWriteSupport.scala#L44>
> >>>>>>>>>> which
> >>>>>>>>>> continues to cause confusion
> >>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-20297> till
> >>> date.
> >>>>>>>>>> Parquet itself is pretty dependent on Hadoop
> >>>>>>>>>> <https://github.com/Parquet/parquet-mr/search?l=Maven+POM&
> >>>>>>>>> q=hadoop&type=&utf8=%E2%9C%93>
> >>>>>>>>>> and,
> >>>>>>>>>> just integrating it with systems with a lot of developers
> >>> (like
> >>>>>>>>>> Spark <
> >>>>> https://www.google.com/webhp?sourceid=chrome-instant&ion=1&
> >>>>>>>>> espv=2&ie=UTF-8#q=spark+parquet+jiras>)
> >>>>>>>>>> is still a lot of work.
> >>>>>>>>>>
> >>>>>>>>>> I personally think we should leverage higher level tools
> >> like
> >>>>>>>>>> Hive, or Spark to write data in widespread formats
> >> (Parquet,
> >>>>> being
> >>>>>>>>>> a very good
> >>>>>>>>>> example) but I personally wouldn't encourage us to manage
> >> the
> >>>>>>>>>> writers ourselves.
> >>>>>>>>>>
> >>>>>>>>>> Thoughts?
> >>>>>>>>>> Mark
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Apr 17, 2017 at 11:44 AM, Michael Ridley
> >>>>>>>>>> <mridley@cloudera.com
> >>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Without having given it too terribly much thought, that
> >>> seems
> >>>>>>>>>>> like an
> >>>>>>>> OK
> >>>>>>>>>>> approach.
> >>>>>>>>>>>
> >>>>>>>>>>> Michael
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith <
> >>>>>>>> nathanael@apache.org>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> I think the question is rather we can write the data
> >>>>> generically
> >>>>>>>>>>>> to
> >>>>>>>>> HDFS
> >>>>>>>>>>>> as parquet without the use of hive/impala?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Today we write parquet data using the hive/mapreduce
> >>> method.
> >>>>>>>>>>>> As part of the redesign i’d like to use libraries for
> >> this
> >>> as
> >>>>>>>>>>>> opposed
> >>>>>>>>> to
> >>>>>>>>>>> a
> >>>>>>>>>>>> hadoop dependency.
> >>>>>>>>>>>> I think it would be preferred to use the python master to
> >>>> write
> >>>>>>>>>>>> the
> >>>>>>>>> data
> >>>>>>>>>>>> into the format we want, then do normalization of the
> >> data
> >>> in
> >>>>>>>>>>>> spark streaming.
> >>>>>>>>>>>> Any thoughts?
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Nathanael
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Apr 17, 2017, at 11:08 AM, Michael Ridley
> >>>>>>>>>>>>> <mridley@cloudera.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I had thought that the plan was to write the data in
> >>> Parquet
> >>>>> in
> >>>>>>>>>>>>> HDFS ultimately.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Michael
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Sun, Apr 16, 2017 at 11:55 AM, kant kodali
> >>>>>>>>>>>>> <kanth909@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Mark,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thank you so much for hearing my argument. And I
> >>> definetly
> >>>>>>>> understand
> >>>>>>>>>>>> that
> >>>>>>>>>>>>>> you guys have bunch of things to do. My only concern is
> >>>> that
> >>>>> I
> >>>>>>>>>>>>>> hope
> >>>>>>>>> it
> >>>>>>>>>>>>>> doesn't take too long too support other backends. For
> >>>> example
> >>>>>>>>> @Kenneth
> >>>>>>>>>>>> had
> >>>>>>>>>>>>>> given an example of LAMP stack had not moved away from
> >>>> mysql
> >>>>>>>>>>>>>> yet
> >>>>>>>>> which
> >>>>>>>>>>>>>> essentially means its probably a decade ? I see that in
> >>> the
> >>>>>>>>>>>>>> current architecture the results from with python
> >>>>>>>>>>>>>> multiprocessing or Spark Streaming are written back to
> >>> HDFS
> >>>>>>>>>>>>>> and  If so, can we write them in
> >>>>>>>>>>>> parquet
> >>>>>>>>>>>>>> format ? such that users should be able to plug in any
> >>>> query
> >>>>>>>>>>>>>> engine
> >>>>>>>>>>> but
> >>>>>>>>>>>>>> again I am not pushing you guys to do this right away
> >> or
> >>>>>>>>>>>>>> anything
> >>>>>>>>> just
> >>>>>>>>>>>>>> seeing if there a way for me to get started in parallel
> >>> and
> >>>>> if
> >>>>>>>>>>>>>> not feasible, its fine I just wanted to share my 2
> >> cents
> >>>> and
> >>>>> I
> >>>>>>>>>>>>>> am glad
> >>>>>>>> my
> >>>>>>>>>>>>>> argument is heard!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks much!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <
> >>>>> mark@apache.org>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi Kant,
> >>>>>>>>>>>>>>> Just wanted to make sure you don't feel like we are
> >>>> ignoring
> >>>>>>>>>>>>>>> your
> >>>>>>>>>>>>>>> comment:-) I hear you about pluggability.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The design can and should be pluggable but the project
> >>> has
> >>>>>>>>>>>>>>> one
> >>>>>>>> stack
> >>>>>>>>>>> it
> >>>>>>>>>>>>>>> ships out of the box with, one stack that's the
> >> default
> >>>>> stack
> >>>>>>>>>>>>>>> in
> >>>>>>>> the
> >>>>>>>>>>>>>> sense
> >>>>>>>>>>>>>>> that it's the most tested and so on. And, for us,
> >> that's
> >>>> our
> >>>>>>>> current
> >>>>>>>>>>>>>> stack.
> >>>>>>>>>>>>>>> If we were to take Apache Hive as an example, it
> >> shipped
> >>>>> (and
> >>>>>>>> ships)
> >>>>>>>>>>>> with
> >>>>>>>>>>>>>>> MapReduce as the default configuration engine. At some
> >>>>> point,
> >>>>>>>> Apache
> >>>>>>>>>>>> Tez
> >>>>>>>>>>>>>>> came along and wanted Hive to run on Tez, so they
> >> made a
> >>>>>>>>>>>>>>> bunch of
> >>>>>>>>>>>> things
> >>>>>>>>>>>>>>> pluggable to run Hive on Tez (instead of the only
> >> option
> >>>>>>>>>>>>>>> up-until
> >>>>>>>>>>> then:
> >>>>>>>>>>>>>>> Hive-on-MR) and then Apache Spark came and re-used
> >> some
> >>> of
> >>>>>>>>>>>>>>> that pluggability and even added some more so
> >>>> Hive-on-Spark
> >>>>>>>>>>>>>>> could
> >>>>>>>> become
> >>>>>>>>> a
> >>>>>>>>>>>>>>> reality. In the same way, I don't think anyone
> >> disagrees
> >>>>> here
> >>>>>>>>>>>>>>> that pluggabilty is a good thing but it's hard to do
> >>>>>>>>>>>>>>> pluggability
> >>>>>>>> right,
> >>>>>>>>>>> and
> >>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> the right level, unless on has a clear use-case in
> >> mind.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> As a project, we have many things to do and I
> >> personally
> >>>>>>>>>>>>>>> think the
> >>>>>>>>>>>>>> biggest
> >>>>>>>>>>>>>>> bang for the buck for us in making Spot a really solid
> >>> and
> >>>>>>>>>>>>>>> the
> >>>>>>>> best
> >>>>>>>>>>>> cyber
> >>>>>>>>>>>>>>> security solution isn't pluggability but the things we
> >>> are
> >>>>>>>>>>>>>>> working
> >>>>>>>>> on
> >>>>>>>>>>>> - a
> >>>>>>>>>>>>>>> better user interface, a common/unified approach to
> >>>> storing
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>> modeling
> >>>>>>>>>>>>>>> data, etc.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Having said that, we are open, if it's important to
> >> you
> >>> or
> >>>>>>>>>>>>>>> someone
> >>>>>>>>>>>> else,
> >>>>>>>>>>>>>>> we'd be happy to receive and review those patches.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks!
> >>>>>>>>>>>>>>> Mark
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali
> >>>>>>>>>>>>>>> <kanth909@gmail.com
> >>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks Ross! and yes option C sounds good to me as
> >> well
> >>>>>>>>>>>>>>>> however I
> >>>>>>>>>>> just
> >>>>>>>>>>>>>>>> think Distributed Sql query engine  and the resource
> >>>>> manager
> >>>>>>>> should
> >>>>>>>>>>> be
> >>>>>>>>>>>>>>>> pluggable.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <
> >>>>>>>>>>> alan.d.ross@intel.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Option C is to use python on the front end of ingest
> >>>>>>>>>>>>>>>>> pipeline
> >>>>>>>> and
> >>>>>>>>>>>>>>>>> spark/scala on the back end.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Option A uses python workers on the backend
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Option B uses all scala.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>>>>>>> From: kant kodali [mailto:kanth909@gmail.com]
> >>>>>>>>>>>>>>>>> Sent: Friday, April 14, 2017 9:53 AM
> >>>>>>>>>>>>>>>>> To: dev@spot.incubator.apache.org
> >>>>>>>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for
> >> Spot-ingest
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> What is option C ? am I missing an email or
> >> something?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha
> >> Palayamkottai
> >>> <
> >>>>>>>>>>>>>>>>> chokha@integralops.com> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> +1 for Python 3.x
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I think that C is the strong solution, getting the
> >>>>> ingest
> >>>>>>>> really
> >>>>>>>>>>>>>>>>>>> strong is going to lower barriers to adoption.
> >> Doing
> >>>> it
> >>>>>>>>>>>>>>>>>>> in
> >>>>>>>>> Python
> >>>>>>>>>>>>>>>>>>> will open up the ingest portion of the project to
> >>>>> include
> >>>>>>>>>>>>>>>>>>> many
> >>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>> developers.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Before it comes up I would like to throw the
> >>> following
> >>>>> on
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> pile...
> >>>>>>>>>>>>>>>>>>> Major
> >>>>>>>>>>>>>>>>>>> python projects django/flash, others are dropping
> >>> 2.x
> >>>>>>>>>>>>>>>>>>> support
> >>>>>>>> in
> >>>>>>>>>>>>>>>>>>> releases scheduled in the next 6 to 8 months.
> >> Hadoop
> >>>>>>>>>>>>>>>>>>> projects
> >>>>>>>> in
> >>>>>>>>>>>>>>>>>>> general tend to lag in modern python support, lets
> >>>>> please
> >>>>>>>> build
> >>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> in 3.5 so that we don't have to immediately
> >> expect a
> >>>>>>>>>>>>>>>>>>> rebuild
> >>>>>>>> in
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> pipeline.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> -Vote C
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks Nate
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Austin
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross
> >>>>>>>>>>>>>>>>>>> <alan@apache.org>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I really like option C because it gives a lot of
> >>>>>>>>>>>>>>>>>>> flexibility
> >>>>>>>> for
> >>>>>>>>>>>>>>>>>>> ingest
> >>>>>>>>>>>>>>>>>>>> (python vs scala) but still has the robust spark
> >>>>>>>>>>>>>>>>>>>> streaming
> >>>>>>>>>>>>>> backend
> >>>>>>>>>>>>>>>>>>>> for performance.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks for putting this together Nate.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Alan
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha
> >>>> Palayamkottai <
> >>>>>>>>>>>>>>>>>>>> chokha@integralops.com> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I agree. We should continue making the existing
> >>> stack
> >>>>>>>>>>>>>>>>>>>> more
> >>>>>>>>>>> mature
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>> this point. Maybe if we have enough community
> >>>> support
> >>>>>>>>>>>>>>>>>>>>> we can
> >>>>>>>>>>> add
> >>>>>>>>>>>>>>>>>>>>> additional datastores.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Chokha.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On 4/14/17 11:10 AM, kenneth@floss.cat wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hi Kant,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> YARN is the standard scheduler in Hadoop. If
> >>> you're
> >>>>>>>>>>>>>>>>>>>>>> using
> >>>>>>>>>>>>>>>>>>>>>> Hive+Spark, then sure you'll have YARN.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said,
> >>>> Spot
> >>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>> based
> >>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>> quite standard Hadoop stack and I wouldn't
> >> switch
> >>>> too
> >>>>>>>>>>>>>>>>>>>>>> many
> >>>>>>>>>>>>>> pieces
> >>>>>>>>>>>>>>>>> yet.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> In most Opensource projects you start relying
> >> on
> >>> a
> >>>>>>>> well-known
> >>>>>>>>>>>>>>>>>>>>>> stack and then you begin to support other DB
> >>>> backends
> >>>>>>>>>>>>>>>>>>>>>> once
> >>>>>>>>>>> it's
> >>>>>>>>>>>>>>>>>>>>>> quite mature. Think in the loads of LAMP apps
> >>> which
> >>>>>>>>>>>>>>>>>>>>>> haven't
> >>>>>>>>>>>>>> been
> >>>>>>>>>>>>>>>>>>>>>> ported away from MySQL yet.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> In any case, you'll need a high performance
> >> SQL +
> >>>>>>>>>>>>>>>>>>>>>> Massive
> >>>>>>>>>>>>>> Storage
> >>>>>>>>>>>>>>>>>>>>>> + Machine Learning + Massive Ingestion, and...
> >>> ATM,
> >>>>>>>>>>>>>>>>>>>>>> + that
> >>>>>>>> can
> >>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>> only provided by Hadoop.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Regards!
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Kenneth
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hi Kenneth,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks for the response.  I think you made a
> >>> case
> >>>>> for
> >>>>>>>>>>>>>>>>>>>>>>> HDFS however users may want to use S3 or some
> >>>> other
> >>>>>>>>>>>>>>>>>>>>>>> FS in which
> >>>>>>>>>>>>>> case
> >>>>>>>>>>>>>>>>>>>>>>> they can use Auxilio (hoping that there are no
> >>>>>>>>>>>>>>>>>>>>>>> changes
> >>>>>>>>> needed
> >>>>>>>>>>>>>>>>>>>>>>> within Spot in which case I
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> agree to that). for example, Netflix stores all
> >>>> there
> >>>>>>>>>>>>>>>>>>>>> data
> >>>>>>>>> into
> >>>>>>>>>>>>>> S3
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> The distributed sql query engine I would say
> >>>> should
> >>>>>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>> pluggable
> >>>>>>>>>>>>>>>>>>>>>>> with whatever user may want to use and there a
> >>>> bunch
> >>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>> them
> >>>>>>>>>>>>>> out
> >>>>>>>>>>>>>>>>> there.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> sure
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Impala is better than hive but what if users are
> >>>>>>>>>>>>>>>>>>>>> already
> >>>>>>>> using
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> something
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> else like Drill or Presto?
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Me personally, would not assume that users are
> >>>>>>>>>>>>>>>>>>>>>>> willing to
> >>>>>>>>>>>>>> deploy
> >>>>>>>>>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> that and make their existing stack more
> >>> complicated
> >>>> at
> >>>>>>>>>>>>>>>>>>>>> very
> >>>>>>>>>>>>>> least
> >>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>>>>>>> say it is a uphill battle. Things have been
> >>>> changing
> >>>>>>>> rapidly
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>> Big
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> space so whatever we think is standard won't be
> >>>>>>>>>>>>>>>>>>>>> standard
> >>>>>>>>>>> anymore
> >>>>>>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>>>>> importantly there shouldn't be any reason why
> >> we
> >>>>>>>>>>>>>>>>>>>>>>> shouldn't
> >>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>> flexible right.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Also I am not sure why only YARN? why not make
> >>>> that
> >>>>>>>>>>>>>>>>>>>>>>> also
> >>>>>>>>> more
> >>>>>>>>>>>>>>>>>>>>>>> flexible so users can pick Mesos or
> >> standalone.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I think Flexibility is a key for a wide
> >> adoption
> >>>>>>>>>>>>>>>>>>>>>>> rather
> >>>>>>>> than
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> tightly
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> coupled architecture.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks!
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth
> >> Peiruza
> >>>>>>>>>>>>>>>>>>>>>>> <kenneth@floss.cat>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> PS: you need a big data platform to be able to
> >>>>>>>>>>>>>>>>>>>>>>> collect all
> >>>>>>>>>>>>>> those
> >>>>>>>>>>>>>>>>>>>>>>>> netflows
> >>>>>>>>>>>>>>>>>>>>>>>> and logs.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Spot isn't intended for SMBs, that's clear,
> >>> then
> >>>>> you
> >>>>>>>>>>>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>> loads
> >>>>>>>>>>>>>>>>>>>>>>>> of data to get ML working properly, and
> >>> somewhere
> >>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> run
> >>>>>>>>>>>>>> those
> >>>>>>>>>>>>>>>>>>>>>>>> algorithms. That
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hadoop.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Regards!
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Kenneth
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Sent from my Mi phone On kant kodali
> >>>>>>>>>>>>>>>>>>>>>>>> <kanth909@gmail.com>, Apr 14, 2017 4:04
> >>>>>>>> AM
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for starting this thread. Here is my
> >>>>> feedback.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> I somehow think the architecture is too
> >>>> complicated
> >>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>> wide
> >>>>>>>>>>>>>>>>>>>>>>>> adoption since it requires to install the
> >>>>> following.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> HDFS.
> >>>>>>>>>>>>>>>>>>>>>>>> HIVE.
> >>>>>>>>>>>>>>>>>>>>>>>> IMPALA.
> >>>>>>>>>>>>>>>>>>>>>>>> KAFKA.
> >>>>>>>>>>>>>>>>>>>>>>>> SPARK (YARN).
> >>>>>>>>>>>>>>>>>>>>>>>> YARN.
> >>>>>>>>>>>>>>>>>>>>>>>> Zookeeper.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Currently there are way too many dependencies
> >>>> that
> >>>>>>>>>>>>>> discourages
> >>>>>>>>>>>>>>>>>>>>>>>> lot of users from using it because they have
> >> to
> >>>> go
> >>>>>>>> through
> >>>>>>>>>>>>>>>>>>>>>>>> deployment of all that required software. I
> >>> think
> >>>>>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>> wide
> >>>>>>>>>>>>>>>>>>>>>>>> option we should minimize the dependencies
> >> and
> >>>> have
> >>>>>>>>>>>>>>>>>>>>>>>> more pluggable architecture. for example I am
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> sure why HIVE & IMPALA both are required? why
> >> not
> >>>> just
> >>>>>>>>>>>>>>>>>>>>> use
> >>>>>>>>>>> Spark
> >>>>>>>>>>>>>>>>>>>>> SQL
> >>>>>>>>>>>>>>>>>>>>>>>> since
> >>>>>>>>>>>>>>>>>>>>>>>> its already dependency or say users may want
> >> to
> >>>> use
> >>>>>>>>>>>>>>>>>>>>>>>> their
> >>>>>>>>>>> own
> >>>>>>>>>>>>>>>>>>>>>>>> distributed query engine they like such as
> >>> Apache
> >>>>>>>>>>>>>>>>>>>>>>>> Drill
> >>>>>>>> or
> >>>>>>>>>>>>>>>>>>>>>>>> something else. we should be flexible enough
> >> to
> >>>>>>>>>>>>>>>>>>>>>>>> provide
> >>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>> option
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Also, I see that HDFS is used such that
> >>>> collectors
> >>>>>>>>>>>>>>>>>>>>>>>> can
> >>>>>>>>>>>>>> receive
> >>>>>>>>>>>>>>>>>>>>>>>> file path's through Kafka and be able to
> >> read a
> >>>>>>>>>>>>>>>>>>>>>>>> file. How
> >>>>>>>>>>> big
> >>>>>>>>>>>>>>>>>>>>>>>> are these files ?
> >>>>>>>>>>>>>>>>>>>>>>>> Do we
> >>>>>>>>>>>>>>>>>>>>>>>> really need HDFS for this? Why not provide
> >> more
> >>>>> ways
> >>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>> send
> >>>>>>>>>>>>>>>>>>>>>>>> data such as sending data directly through
> >>> Kafka
> >>>> or
> >>>>>>>>>>>>>>>>>>>>>>>> say
> >>>>>>>>> just
> >>>>>>>>>>>>>>>>>>>>>>>> leaving up to the user to specify the file
> >>>> location
> >>>>>>>>>>>>>>>>>>>>>>>> as an argument to collector process
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Finally, I learnt that to generate Net flow
> >>> data
> >>>>> one
> >>>>>>>> would
> >>>>>>>>>>>>>>>>>>>>>>>> require a specific hardware. This really
> >> means
> >>>>>>>>>>>>>>>>>>>>>>>> Apache
> >>>>>>>> Spot
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>> not meant for everyone.
> >>>>>>>>>>>>>>>>>>>>>>>> I thought Apache Spot can be used to analyze
> >>> the
> >>>>>>>>>>>>>>>>>>>>>>>> network
> >>>>>>>>>>>>>>> traffic
> >>>>>>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> any
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> machine but if it requires a specific hard then
> >> I
> >>>>> think
> >>>>>>>>>>>>>>>>>>>>> it
> >>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>> targeted for
> >>>>>>>>>>>>>>>>>>>>>>>> specific group of people.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> The real strength of Apache Spot should
> >> mainly
> >>> be
> >>>>>>>>>>>>>>>>>>>>>>>> just
> >>>>>>>>>>>>>>> analyzing
> >>>>>>>>>>>>>>>>>>>>>>>> network traffic through ML.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks!
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind,
> >>>> Nathan
> >>>>> L
> >>>>>>>>>>>>>>>>>>>>>>>> < nathan.l.segerlind@intel.com> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks, Nate,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Nate.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>>>>>>>>>>>>>>> From: Nate Smith [mailto:
> >>> natedogs911@gmail.com]
> >>>>>>>>>>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
> >>>>>>>>>>>>>>>>>>>>>>>>> To: user@spot.incubator.apache.org
> >>>>>>>>>>>>>>>>>>>>>>>>> Cc: dev@spot.incubator.apache.org;
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> private@spot.incubator.apache.org
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for
> >>>> Spot-ingest
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I was really hoping it came through ok, Oh
> >>> well
> >>>> :)
> >>>>>>>> Here’s
> >>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>>>> image form:
> >>>>>>>>>>>>>>>>>>>>>>>>> http://imgur.com/a/DUDsD
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind,
> >> Nathan
> >>>> L <
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> nathan.l.segerlind@intel.com> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> The diagram became garbled in the text
> >>> format.
> >>>>>>>>>>>>>>>>>>>>>>>>>> Could you resend it as a pdf?
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>> Nate
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>>>>>>>>>>>>>>>> From: Nathanael Smith
> >>>>>>>>>>>>>>>>>>>>>>>>>> [mailto:nathanael@apache.org]
> >>>>>>>>>>>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
> >>>>>>>>>>>>>>>>>>>>>>>>>> To: private@spot.incubator.apache.org;
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> dev@spot.incubator.apache.org;
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> user@spot.incubator.apache.org
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Subject: [Discuss] - Future plans for
> >>>> Spot-ingest
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> How would you like to see Spot-ingest
> >> change?
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> A. continue development on the Python
> >>>>>>>>>>>>>>>>>>>>>>>>>> Master/Worker
> >>>>>>>> with
> >>>>>>>>>>>>>>> focus
> >>>>>>>>>>>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> performance / error handling / logging B.
> >>>> Develop
> >>>>>>>>>>>>>>>>>>>>>>>>> Scala
> >>>>>>>>>>>>>> based
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> ingest to
> >>>>>>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> inline with code base from ingest, ml, to OA
> >>> (UI
> >>>>> to
> >>>>>>>>>>> continue
> >>>>>>>>>>>>>>>>>>>>>>>>> being
> >>>>>>>>>>>>>>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with
> >> Scala
> >>>>>>>>>>>>>>>>>>>>>>>>> based
> >>>>>>>> Spark
> >>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>>>>>>>>>> for normalization and input into DB
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Including the high level diagram:
> >>>>>>>>>>>>>>>>>>>>>>>>>> +-----------------------------
> >>>>>>>>>>>>>> ------------------------------
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> -------------------------------+
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> | +--------------------------+
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> +-----------------+        |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> | | Master                   |  A. B. C.
> >>>>>>>>>>>>>>>> |
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Worker          |        |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> | |    A. Python
> >>> +---------------+
> >>>>>>> A.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> |   A.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Python     |        |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> | |    B. Scala              |
> >>> |
> >>>>>>>>>>>>>>>> +------------->
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>        +----+   |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> | |    C. Python             |
> >>> |
> >>>>>  |
> >>>>>>>>>>>>>>>> |
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>        |    |   |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> | +---^------+---------------+
> >>> |
> >>>>>  |
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> +-----------------+    |   |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> |     |      |
> >>> |
> >>>>>  |
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>             |   |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> |     |      |
> >>> |
> >>>>>  |
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>             |   |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> |     |     +Note--------------+
> >>> |
> >>>>>  |
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> +-----------------+    |   |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> |     |     |Running on a      |
> >>> |
> >>>>>  |
> >>>>>>>>>>>>>>>> |
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Spark
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Streaming |    |   |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> |     |     |worker node in    |
> >>> |
> >>>>>  |
> >>>>>>>>>>> B.
> >>>>>>>>>>>>>>> C.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> | B.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Scala        |    |   |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> |     |     |the Hadoop cluster|
> >>> |
> >>>>>  |
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> +--------> C.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Scala        +-+  |   |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> |     |     +------------------+
> >>> |
> >>>>>  |
> >>>>>>>> |
> >>>>>>>>>>>>>>>> |
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>        | |  |   |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> |   A.|
> >>> |
> >>>>>  |
> >>>>>>>> |
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> +-----------------+ |  |   |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> |   B.|
> >>> |
> >>>>>  |
> >>>>>>>> |
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>           |  |   |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> |   C.|
> >>> |
> >>>>>  |
> >>>>>>>> |
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>           |  |   |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> | +----------------------+
> >>>>>>>>> +-v------+----+----+-+
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> +--------------v--v-+ |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> | |                      |          |
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> |           |
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>                | |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> | |   Local FS:          |          |
> >> hdfs
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> |           |
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Hive / Impala    | |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> | |  - Binary/Text       |          |
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> |           |
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> - Parquet -     | |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> | |    Log files -       |          |
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> |           |
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>                | |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> | |                      |          |
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> |           |
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>                | |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> | +----------------------+
> >>>>>>>>> +--------------------+
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> +-------------------+ |
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> +-----------------------------
> >>>>>>>>>>>>>> ------------------------------
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> -------------------------------+
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Please let me know your thoughts,
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> - Nathanael
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Michael Ridley <mridley@cloudera.com>
> >>>>>>>>>>>>> office: (650) 352-1337
> >>>>>>>>>>>>> mobile: (571) 438-2420
> >>>>>>>>>>>>> Senior Solutions Architect
> >>>>>>>>>>>>> Cloudera, Inc.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Michael Ridley <mridley@cloudera.com>
> >>>>>>>>>>> office: (650) 352-1337
> >>>>>>>>>>> mobile: (571) 438-2420
> >>>>>>>>>>> Senior Solutions Architect
> >>>>>>>>>>> Cloudera, Inc.
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Michael Ridley <mridley@cloudera.com>
> >> office: (650) 352-1337
> >> mobile: (571) 438-2420
> >> Senior Solutions Architect
> >> Cloudera, Inc.
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message