spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Smith, Nathanael P" <nathanael.p.sm...@intel.com>
Subject Re: [Discuss] - Future plans for Spot-ingest
Date Fri, 21 Apr 2017 02:49:04 GMT
If you want to code a quick POC I will run it on our data. This sounds great Austin. 

- nathanael 

> On Apr 20, 2017, at 2:41 PM, Austin Leahy <Austin@digitalminion.com> wrote:
> 
> So this is basically why the flume suggestion has come up. Flume natively
> acts as a syslog listener and will write files to basically anything (HDFS,
> Hive, HBase, S3).
> 
>> On Thu, Apr 20, 2017 at 8:15 AM Michael Ridley <mridley@cloudera.com> wrote:
>> 
>> When we say ingest from Kafka, what does that mean?  I understand we can
>> read from Kafka to ingest into the cluster, but how will the data get to
>> Kafka and what data are we talking about?  My understanding is that right
>> now the primary data sources would be Netflow and Syslog, neither of which
>> writes to Kafka natively so we would need something like StreamSets in the
>> middle.  Certainly StreamSets UDP source -> Kafka would work.
>> 
>> Michael
>> 
>>> On Wed, Apr 19, 2017 at 7:05 PM, kant kodali <kanth909@gmail.com> wrote:
>>> 
>>> sure I guess Kafka has something called Kafka connect but may not be as
>>> mature as flume since I heard about this recently.
>>> 
>>> On Wed, Apr 19, 2017 at 3:39 PM, Austin Leahy <Austin@digitalminion.com>
>>> wrote:
>>> 
>>>> The advantage of flume or a flume Kafka hybrid is that the team doesn't
>>>> have to build sinks for any new source types added to the project just
>>>> create configs pointing to the landing pad
>>>> On Wed, Apr 19, 2017 at 3:31 PM kant kodali <kanth909@gmail.com>
>> wrote:
>>>> 
>>>>> What kind of benchmarks are we looking for? just throughput? since I
>> am
>>>>> assuming this is for ingestion. I haven't seen anything faster than
>>> Kafka
>>>>> and that is because of its simplicity after all publisher appends
>>> message
>>>>> to a file(so called the partition in kafka) and clients just do
>>>> sequential
>>>>> reads from a file so its a matter of disk throughput. The benchmark
>>>> numbers
>>>>> I have for Kafka is at very least 75K messages/sec where each message
>>> is
>>>>> 1KB on m4.xlarge which by default has EBS storage (EBS is
>>>> network-attached
>>>>> SSD disk). The network attached disk has a max throughput of
>>>>> 125MB/s(m4.xlarge has 1Gigabit) but if we were deploy it on ephemeral
>>>>> storage (local-SSD) and on a 10 Gigabit Network we would easily get
>>> 5-10X
>>>>> more.
>>>>> 
>>>>> No idea about flume.
>>>>> 
>>>>> Finally, not trying to pitch for Kafka however it is fastest I have
>>> seen
>>>>> but if someone has better numbers for flume then we should use that.
>>>> Also I
>>>>> would suspect there are benchmarks for Kafka vs Flume available
>> online
>>>>> already or we can try it with our own datasets.
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> On Wed, Apr 19, 2017 at 3:09 PM, Austin Leahy <
>>> Austin@digitalminion.com>
>>>>> wrote:
>>>>> 
>>>>>> I am happy to create and test a flume source... #intelteam would
>> need
>>>> to
>>>>>> create the benchmark by deploying it and pointing a data source at
>>>> it...
>>>>>> since I don't have good enough volume of source data handy
>>>>>> On Wed, Apr 19, 2017 at 3:04 PM Ross, Alan D <
>> alan.d.ross@intel.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> We discussed this in our staff meeting a bit today.  I would like
>>> to
>>>>> see
>>>>>>> some benchmarking of different approaches (kafka, flume, etc) to
>>> see
>>>>> what
>>>>>>> the numbers look like. Is anyone in the community willing to
>>>> volunteer
>>>>> on
>>>>>>> this work?
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Austin Leahy [mailto:Austin@digitalminion.com]
>>>>>>> Sent: Wednesday, April 19, 2017 1:05 PM
>>>>>>> To: dev@spot.incubator.apache.org
>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
>>>>>>> 
>>>>>>> I think Kafka is probably a red herring. It's an industry goto in
>>> the
>>>>>>> application world because because of redundancy but the type and
>>>>> volumes
>>>>>> of
>>>>>>> network telemetry that we are talking about here will bog kafka
>>> down
>>>>>> unless
>>>>>>> you dedicate really serious hardware to just the kafka
>>>> implementation.
>>>>>> It's
>>>>>>> essentially the next level of the problem that the team was
>> already
>>>>>> running
>>>>>>> into when rabbitMQ was queueing in data.
>>>>>>> 
>>>>>>> On Wed, Apr 19, 2017 at 12:33 PM Mark Grover <mark@apache.org>
>>>> wrote:
>>>>>>> 
>>>>>>>> On Wed, Apr 19, 2017 at 10:19 AM, Smith, Nathanael P <
>>>>>>>> nathanael.p.smith@intel.com> wrote:
>>>>>>>> 
>>>>>>>>> Mark,
>>>>>>>>> 
>>>>>>>>> just digesting the below.
>>>>>>>>> 
>>>>>>>>> Backing up in my thought process, I was thinking that the
>>> ingest
>>>>>>>>> master (first point of entry into the system) would want to
>> put
>>>> the
>>>>>>>>> data into a standard serializable format. I was thinking that
>>>>>>>>> libraries (such as pyarrow in this case) could help by
>> writing
>>>> the
>>>>>>>>> data in parquet format early in the process. You are probably
>>>>>>>>> correct that at this point in time it might not be worth the
>>> time
>>>>> and
>>>>>>> can be kept in the backlog.
>>>>>>>>> That being said, I still think the master should produce data
>>> in
>>>> a
>>>>>>>>> standard format, what in your opinion (and I open this up of
>>>> course
>>>>>>>>> to
>>>>>>>>> others) would be the most logical format?
>>>>>>>>> the most basic would be to just keep it as a .csv.
>>>>>>>>> 
>>>>>>>>> The master will likely write data to a staging directory in
>>> HDFS
>>>>>>>>> where
>>>>>>>> the
>>>>>>>>> spark streaming job will pick it up for normalization/writing
>>> to
>>>>>>>>> parquet
>>>>>>>> in
>>>>>>>>> the correct block sizes and partitions.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Nate,
>>>>>>>> Avro is usually preferred for such a standard format - because
>> it
>>>>>>>> asserts a schema (types, etc.) which CSV doesn't and it allows
>>> for
>>>>>>>> schema evolution which depending on the type of evolution, CSV
>>> may
>>>> or
>>>>>>> may not support.
>>>>>>>> And, that's something I have seen being done very commonly.
>>>>>>>> 
>>>>>>>> Now, if the data were in Kafka before it gets to master, one
>>> could
>>>>>>>> argue that the master could just send metadata to the workers
>>>> (topic
>>>>>>>> name, partition number, offset start and end) and the workers
>>> could
>>>>>>>> read from Kafka directly. I do understand that'd be a much
>>>> different
>>>>>>>> architecture than the current one, but if you think it's a good
>>>> idea
>>>>>>>> too, we could document that, say in a JIRA, and (de-)prioritize
>>> it
>>>>>>>> (and in line with the rest of the discussion on this thread,
>> it's
>>>> not
>>>>>>> the top-most priority).
>>>>>>>> Thoughts?
>>>>>>>> 
>>>>>>>> - Nathanael
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Apr 17, 2017, at 1:12 PM, Mark Grover <mark@apache.org>
>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Thanks all your opinion.
>>>>>>>>>> 
>>>>>>>>>> I think it's good to consider two things:
>>>>>>>>>> 1. What do (we think) users care about?
>>>>>>>>>> 2. What's the cost of changing things?
>>>>>>>>>> 
>>>>>>>>>> About #1, I think users care more about what format data is
>>>>>>>>>> written
>>>>>>>> than
>>>>>>>>>> how the data is written. I'd argue whether that uses Hive,
>>> MR,
>>>> or
>>>>>>>>>> a
>>>>>>>>> custom
>>>>>>>>>> Parquet writer is not as important to them as long as we
>>>> maintain
>>>>>>>>>> data/format compatibility.
>>>>>>>>>> About #2, having worked on several projects, I find that
>> it's
>>>>>>>>>> rather difficult to keep up with Parquet. Even in Spark,
>>> there
>>>>> are
>>>>>>>>>> a few
>>>>>>>>> different
>>>>>>>>>> ways to write to Parquet - there's a regular mode, and a
>>> legacy
>>>>>>>>>> mode <
>> https://github.com/apache/spark/blob/master/sql/core/
>>>>>>>>> src/main/scala/org/apache/spark/sql/execution/
>>>> datasources/parquet/
>>>>>>>>> ParquetWriteSupport.scala#L44>
>>>>>>>>>> which
>>>>>>>>>> continues to cause confusion
>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-20297> till
>>> date.
>>>>>>>>>> Parquet itself is pretty dependent on Hadoop
>>>>>>>>>> <https://github.com/Parquet/parquet-mr/search?l=Maven+POM&
>>>>>>>>> q=hadoop&type=&utf8=%E2%9C%93>
>>>>>>>>>> and,
>>>>>>>>>> just integrating it with systems with a lot of developers
>>> (like
>>>>>>>>>> Spark <
>>>>> https://www.google.com/webhp?sourceid=chrome-instant&ion=1&
>>>>>>>>> espv=2&ie=UTF-8#q=spark+parquet+jiras>)
>>>>>>>>>> is still a lot of work.
>>>>>>>>>> 
>>>>>>>>>> I personally think we should leverage higher level tools
>> like
>>>>>>>>>> Hive, or Spark to write data in widespread formats
>> (Parquet,
>>>>> being
>>>>>>>>>> a very good
>>>>>>>>>> example) but I personally wouldn't encourage us to manage
>> the
>>>>>>>>>> writers ourselves.
>>>>>>>>>> 
>>>>>>>>>> Thoughts?
>>>>>>>>>> Mark
>>>>>>>>>> 
>>>>>>>>>> On Mon, Apr 17, 2017 at 11:44 AM, Michael Ridley
>>>>>>>>>> <mridley@cloudera.com
>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Without having given it too terribly much thought, that
>>> seems
>>>>>>>>>>> like an
>>>>>>>> OK
>>>>>>>>>>> approach.
>>>>>>>>>>> 
>>>>>>>>>>> Michael
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith <
>>>>>>>> nathanael@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> I think the question is rather we can write the data
>>>>> generically
>>>>>>>>>>>> to
>>>>>>>>> HDFS
>>>>>>>>>>>> as parquet without the use of hive/impala?
>>>>>>>>>>>> 
>>>>>>>>>>>> Today we write parquet data using the hive/mapreduce
>>> method.
>>>>>>>>>>>> As part of the redesign i’d like to use libraries for
>> this
>>> as
>>>>>>>>>>>> opposed
>>>>>>>>> to
>>>>>>>>>>> a
>>>>>>>>>>>> hadoop dependency.
>>>>>>>>>>>> I think it would be preferred to use the python master to
>>>> write
>>>>>>>>>>>> the
>>>>>>>>> data
>>>>>>>>>>>> into the format we want, then do normalization of the
>> data
>>> in
>>>>>>>>>>>> spark streaming.
>>>>>>>>>>>> Any thoughts?
>>>>>>>>>>>> 
>>>>>>>>>>>> - Nathanael
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Apr 17, 2017, at 11:08 AM, Michael Ridley
>>>>>>>>>>>>> <mridley@cloudera.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I had thought that the plan was to write the data in
>>> Parquet
>>>>> in
>>>>>>>>>>>>> HDFS ultimately.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Michael
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sun, Apr 16, 2017 at 11:55 AM, kant kodali
>>>>>>>>>>>>> <kanth909@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Mark,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thank you so much for hearing my argument. And I
>>> definetly
>>>>>>>> understand
>>>>>>>>>>>> that
>>>>>>>>>>>>>> you guys have bunch of things to do. My only concern is
>>>> that
>>>>> I
>>>>>>>>>>>>>> hope
>>>>>>>>> it
>>>>>>>>>>>>>> doesn't take too long too support other backends. For
>>>> example
>>>>>>>>> @Kenneth
>>>>>>>>>>>> had
>>>>>>>>>>>>>> given an example of LAMP stack had not moved away from
>>>> mysql
>>>>>>>>>>>>>> yet
>>>>>>>>> which
>>>>>>>>>>>>>> essentially means its probably a decade ? I see that in
>>> the
>>>>>>>>>>>>>> current architecture the results from with python
>>>>>>>>>>>>>> multiprocessing or Spark Streaming are written back to
>>> HDFS
>>>>>>>>>>>>>> and  If so, can we write them in
>>>>>>>>>>>> parquet
>>>>>>>>>>>>>> format ? such that users should be able to plug in any
>>>> query
>>>>>>>>>>>>>> engine
>>>>>>>>>>> but
>>>>>>>>>>>>>> again I am not pushing you guys to do this right away
>> or
>>>>>>>>>>>>>> anything
>>>>>>>>> just
>>>>>>>>>>>>>> seeing if there a way for me to get started in parallel
>>> and
>>>>> if
>>>>>>>>>>>>>> not feasible, its fine I just wanted to share my 2
>> cents
>>>> and
>>>>> I
>>>>>>>>>>>>>> am glad
>>>>>>>> my
>>>>>>>>>>>>>> argument is heard!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks much!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <
>>>>> mark@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Kant,
>>>>>>>>>>>>>>> Just wanted to make sure you don't feel like we are
>>>> ignoring
>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>> comment:-) I hear you about pluggability.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The design can and should be pluggable but the project
>>> has
>>>>>>>>>>>>>>> one
>>>>>>>> stack
>>>>>>>>>>> it
>>>>>>>>>>>>>>> ships out of the box with, one stack that's the
>> default
>>>>> stack
>>>>>>>>>>>>>>> in
>>>>>>>> the
>>>>>>>>>>>>>> sense
>>>>>>>>>>>>>>> that it's the most tested and so on. And, for us,
>> that's
>>>> our
>>>>>>>> current
>>>>>>>>>>>>>> stack.
>>>>>>>>>>>>>>> If we were to take Apache Hive as an example, it
>> shipped
>>>>> (and
>>>>>>>> ships)
>>>>>>>>>>>> with
>>>>>>>>>>>>>>> MapReduce as the default configuration engine. At some
>>>>> point,
>>>>>>>> Apache
>>>>>>>>>>>> Tez
>>>>>>>>>>>>>>> came along and wanted Hive to run on Tez, so they
>> made a
>>>>>>>>>>>>>>> bunch of
>>>>>>>>>>>> things
>>>>>>>>>>>>>>> pluggable to run Hive on Tez (instead of the only
>> option
>>>>>>>>>>>>>>> up-until
>>>>>>>>>>> then:
>>>>>>>>>>>>>>> Hive-on-MR) and then Apache Spark came and re-used
>> some
>>> of
>>>>>>>>>>>>>>> that pluggability and even added some more so
>>>> Hive-on-Spark
>>>>>>>>>>>>>>> could
>>>>>>>> become
>>>>>>>>> a
>>>>>>>>>>>>>>> reality. In the same way, I don't think anyone
>> disagrees
>>>>> here
>>>>>>>>>>>>>>> that pluggabilty is a good thing but it's hard to do
>>>>>>>>>>>>>>> pluggability
>>>>>>>> right,
>>>>>>>>>>> and
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> the right level, unless on has a clear use-case in
>> mind.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> As a project, we have many things to do and I
>> personally
>>>>>>>>>>>>>>> think the
>>>>>>>>>>>>>> biggest
>>>>>>>>>>>>>>> bang for the buck for us in making Spot a really solid
>>> and
>>>>>>>>>>>>>>> the
>>>>>>>> best
>>>>>>>>>>>> cyber
>>>>>>>>>>>>>>> security solution isn't pluggability but the things we
>>> are
>>>>>>>>>>>>>>> working
>>>>>>>>> on
>>>>>>>>>>>> - a
>>>>>>>>>>>>>>> better user interface, a common/unified approach to
>>>> storing
>>>>>>>>>>>>>>> and
>>>>>>>>>>>> modeling
>>>>>>>>>>>>>>> data, etc.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Having said that, we are open, if it's important to
>> you
>>> or
>>>>>>>>>>>>>>> someone
>>>>>>>>>>>> else,
>>>>>>>>>>>>>>> we'd be happy to receive and review those patches.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali
>>>>>>>>>>>>>>> <kanth909@gmail.com
>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks Ross! and yes option C sounds good to me as
>> well
>>>>>>>>>>>>>>>> however I
>>>>>>>>>>> just
>>>>>>>>>>>>>>>> think Distributed Sql query engine  and the resource
>>>>> manager
>>>>>>>> should
>>>>>>>>>>> be
>>>>>>>>>>>>>>>> pluggable.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <
>>>>>>>>>>> alan.d.ross@intel.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Option C is to use python on the front end of ingest
>>>>>>>>>>>>>>>>> pipeline
>>>>>>>> and
>>>>>>>>>>>>>>>>> spark/scala on the back end.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Option A uses python workers on the backend
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Option B uses all scala.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>> From: kant kodali [mailto:kanth909@gmail.com]
>>>>>>>>>>>>>>>>> Sent: Friday, April 14, 2017 9:53 AM
>>>>>>>>>>>>>>>>> To: dev@spot.incubator.apache.org
>>>>>>>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for
>> Spot-ingest
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> What is option C ? am I missing an email or
>> something?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha
>> Palayamkottai
>>> <
>>>>>>>>>>>>>>>>> chokha@integralops.com> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> +1 for Python 3.x
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I think that C is the strong solution, getting the
>>>>> ingest
>>>>>>>> really
>>>>>>>>>>>>>>>>>>> strong is going to lower barriers to adoption.
>> Doing
>>>> it
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>> Python
>>>>>>>>>>>>>>>>>>> will open up the ingest portion of the project to
>>>>> include
>>>>>>>>>>>>>>>>>>> many
>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>> developers.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Before it comes up I would like to throw the
>>> following
>>>>> on
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> pile...
>>>>>>>>>>>>>>>>>>> Major
>>>>>>>>>>>>>>>>>>> python projects django/flash, others are dropping
>>> 2.x
>>>>>>>>>>>>>>>>>>> support
>>>>>>>> in
>>>>>>>>>>>>>>>>>>> releases scheduled in the next 6 to 8 months.
>> Hadoop
>>>>>>>>>>>>>>>>>>> projects
>>>>>>>> in
>>>>>>>>>>>>>>>>>>> general tend to lag in modern python support, lets
>>>>> please
>>>>>>>> build
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>> in 3.5 so that we don't have to immediately
>> expect a
>>>>>>>>>>>>>>>>>>> rebuild
>>>>>>>> in
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> pipeline.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -Vote C
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks Nate
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Austin
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross
>>>>>>>>>>>>>>>>>>> <alan@apache.org>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I really like option C because it gives a lot of
>>>>>>>>>>>>>>>>>>> flexibility
>>>>>>>> for
>>>>>>>>>>>>>>>>>>> ingest
>>>>>>>>>>>>>>>>>>>> (python vs scala) but still has the robust spark
>>>>>>>>>>>>>>>>>>>> streaming
>>>>>>>>>>>>>> backend
>>>>>>>>>>>>>>>>>>>> for performance.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for putting this together Nate.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Alan
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha
>>>> Palayamkottai <
>>>>>>>>>>>>>>>>>>>> chokha@integralops.com> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I agree. We should continue making the existing
>>> stack
>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>> mature
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>> this point. Maybe if we have enough community
>>>> support
>>>>>>>>>>>>>>>>>>>>> we can
>>>>>>>>>>> add
>>>>>>>>>>>>>>>>>>>>> additional datastores.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Chokha.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On 4/14/17 11:10 AM, kenneth@floss.cat wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi Kant,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> YARN is the standard scheduler in Hadoop. If
>>> you're
>>>>>>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>>>>>> Hive+Spark, then sure you'll have YARN.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said,
>>>> Spot
>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>> based
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> quite standard Hadoop stack and I wouldn't
>> switch
>>>> too
>>>>>>>>>>>>>>>>>>>>>> many
>>>>>>>>>>>>>> pieces
>>>>>>>>>>>>>>>>> yet.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> In most Opensource projects you start relying
>> on
>>> a
>>>>>>>> well-known
>>>>>>>>>>>>>>>>>>>>>> stack and then you begin to support other DB
>>>> backends
>>>>>>>>>>>>>>>>>>>>>> once
>>>>>>>>>>> it's
>>>>>>>>>>>>>>>>>>>>>> quite mature. Think in the loads of LAMP apps
>>> which
>>>>>>>>>>>>>>>>>>>>>> haven't
>>>>>>>>>>>>>> been
>>>>>>>>>>>>>>>>>>>>>> ported away from MySQL yet.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> In any case, you'll need a high performance
>> SQL +
>>>>>>>>>>>>>>>>>>>>>> Massive
>>>>>>>>>>>>>> Storage
>>>>>>>>>>>>>>>>>>>>>> + Machine Learning + Massive Ingestion, and...
>>> ATM,
>>>>>>>>>>>>>>>>>>>>>> + that
>>>>>>>> can
>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>> only provided by Hadoop.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Regards!
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Kenneth
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi Kenneth,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response.  I think you made a
>>> case
>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> HDFS however users may want to use S3 or some
>>>> other
>>>>>>>>>>>>>>>>>>>>>>> FS in which
>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>>>>>>> they can use Auxilio (hoping that there are no
>>>>>>>>>>>>>>>>>>>>>>> changes
>>>>>>>>> needed
>>>>>>>>>>>>>>>>>>>>>>> within Spot in which case I
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> agree to that). for example, Netflix stores all
>>>> there
>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>> into
>>>>>>>>>>>>>> S3
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> The distributed sql query engine I would say
>>>> should
>>>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>> pluggable
>>>>>>>>>>>>>>>>>>>>>>> with whatever user may want to use and there a
>>>> bunch
>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>> them
>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>> there.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Impala is better than hive but what if users are
>>>>>>>>>>>>>>>>>>>>> already
>>>>>>>> using
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> else like Drill or Presto?
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Me personally, would not assume that users are
>>>>>>>>>>>>>>>>>>>>>>> willing to
>>>>>>>>>>>>>> deploy
>>>>>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> that and make their existing stack more
>>> complicated
>>>> at
>>>>>>>>>>>>>>>>>>>>> very
>>>>>>>>>>>>>> least
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>>> say it is a uphill battle. Things have been
>>>> changing
>>>>>>>> rapidly
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> Big
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> space so whatever we think is standard won't be
>>>>>>>>>>>>>>>>>>>>> standard
>>>>>>>>>>> anymore
>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>> importantly there shouldn't be any reason why
>> we
>>>>>>>>>>>>>>>>>>>>>>> shouldn't
>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>> flexible right.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Also I am not sure why only YARN? why not make
>>>> that
>>>>>>>>>>>>>>>>>>>>>>> also
>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>> flexible so users can pick Mesos or
>> standalone.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I think Flexibility is a key for a wide
>> adoption
>>>>>>>>>>>>>>>>>>>>>>> rather
>>>>>>>> than
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> tightly
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> coupled architecture.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth
>> Peiruza
>>>>>>>>>>>>>>>>>>>>>>> <kenneth@floss.cat>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> PS: you need a big data platform to be able to
>>>>>>>>>>>>>>>>>>>>>>> collect all
>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>>>>>>> netflows
>>>>>>>>>>>>>>>>>>>>>>>> and logs.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Spot isn't intended for SMBs, that's clear,
>>> then
>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>> loads
>>>>>>>>>>>>>>>>>>>>>>>> of data to get ML working properly, and
>>> somewhere
>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>> run
>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>>>>>>> algorithms. That
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hadoop.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Regards!
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Kenneth
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Sent from my Mi phone On kant kodali
>>>>>>>>>>>>>>>>>>>>>>>> <kanth909@gmail.com>, Apr 14, 2017 4:04
>>>>>>>> AM
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for starting this thread. Here is my
>>>>> feedback.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> I somehow think the architecture is too
>>>> complicated
>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>> wide
>>>>>>>>>>>>>>>>>>>>>>>> adoption since it requires to install the
>>>>> following.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> HDFS.
>>>>>>>>>>>>>>>>>>>>>>>> HIVE.
>>>>>>>>>>>>>>>>>>>>>>>> IMPALA.
>>>>>>>>>>>>>>>>>>>>>>>> KAFKA.
>>>>>>>>>>>>>>>>>>>>>>>> SPARK (YARN).
>>>>>>>>>>>>>>>>>>>>>>>> YARN.
>>>>>>>>>>>>>>>>>>>>>>>> Zookeeper.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Currently there are way too many dependencies
>>>> that
>>>>>>>>>>>>>> discourages
>>>>>>>>>>>>>>>>>>>>>>>> lot of users from using it because they have
>> to
>>>> go
>>>>>>>> through
>>>>>>>>>>>>>>>>>>>>>>>> deployment of all that required software. I
>>> think
>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>> wide
>>>>>>>>>>>>>>>>>>>>>>>> option we should minimize the dependencies
>> and
>>>> have
>>>>>>>>>>>>>>>>>>>>>>>> more pluggable architecture. for example I am
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> sure why HIVE & IMPALA both are required? why
>> not
>>>> just
>>>>>>>>>>>>>>>>>>>>> use
>>>>>>>>>>> Spark
>>>>>>>>>>>>>>>>>>>>> SQL
>>>>>>>>>>>>>>>>>>>>>>>> since
>>>>>>>>>>>>>>>>>>>>>>>> its already dependency or say users may want
>> to
>>>> use
>>>>>>>>>>>>>>>>>>>>>>>> their
>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>>>> distributed query engine they like such as
>>> Apache
>>>>>>>>>>>>>>>>>>>>>>>> Drill
>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>> something else. we should be flexible enough
>> to
>>>>>>>>>>>>>>>>>>>>>>>> provide
>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>> option
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Also, I see that HDFS is used such that
>>>> collectors
>>>>>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>> receive
>>>>>>>>>>>>>>>>>>>>>>>> file path's through Kafka and be able to
>> read a
>>>>>>>>>>>>>>>>>>>>>>>> file. How
>>>>>>>>>>> big
>>>>>>>>>>>>>>>>>>>>>>>> are these files ?
>>>>>>>>>>>>>>>>>>>>>>>> Do we
>>>>>>>>>>>>>>>>>>>>>>>> really need HDFS for this? Why not provide
>> more
>>>>> ways
>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>> send
>>>>>>>>>>>>>>>>>>>>>>>> data such as sending data directly through
>>> Kafka
>>>> or
>>>>>>>>>>>>>>>>>>>>>>>> say
>>>>>>>>> just
>>>>>>>>>>>>>>>>>>>>>>>> leaving up to the user to specify the file
>>>> location
>>>>>>>>>>>>>>>>>>>>>>>> as an argument to collector process
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Finally, I learnt that to generate Net flow
>>> data
>>>>> one
>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>>>> require a specific hardware. This really
>> means
>>>>>>>>>>>>>>>>>>>>>>>> Apache
>>>>>>>> Spot
>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>> not meant for everyone.
>>>>>>>>>>>>>>>>>>>>>>>> I thought Apache Spot can be used to analyze
>>> the
>>>>>>>>>>>>>>>>>>>>>>>> network
>>>>>>>>>>>>>>> traffic
>>>>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> any
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> machine but if it requires a specific hard then
>> I
>>>>> think
>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>> targeted for
>>>>>>>>>>>>>>>>>>>>>>>> specific group of people.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> The real strength of Apache Spot should
>> mainly
>>> be
>>>>>>>>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>> analyzing
>>>>>>>>>>>>>>>>>>>>>>>> network traffic through ML.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind,
>>>> Nathan
>>>>> L
>>>>>>>>>>>>>>>>>>>>>>>> < nathan.l.segerlind@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Nate,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Nate.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>>>>>> From: Nate Smith [mailto:
>>> natedogs911@gmail.com]
>>>>>>>>>>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
>>>>>>>>>>>>>>>>>>>>>>>>> To: user@spot.incubator.apache.org
>>>>>>>>>>>>>>>>>>>>>>>>> Cc: dev@spot.incubator.apache.org;
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> private@spot.incubator.apache.org
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for
>>>> Spot-ingest
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I was really hoping it came through ok, Oh
>>> well
>>>> :)
>>>>>>>> Here’s
>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>>>>> image form:
>>>>>>>>>>>>>>>>>>>>>>>>> http://imgur.com/a/DUDsD
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind,
>> Nathan
>>>> L <
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> nathan.l.segerlind@intel.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> The diagram became garbled in the text
>>> format.
>>>>>>>>>>>>>>>>>>>>>>>>>> Could you resend it as a pdf?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>> Nate
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>>>>>>>>>>> From: Nathanael Smith
>>>>>>>>>>>>>>>>>>>>>>>>>> [mailto:nathanael@apache.org]
>>>>>>>>>>>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
>>>>>>>>>>>>>>>>>>>>>>>>>> To: private@spot.incubator.apache.org;
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> dev@spot.incubator.apache.org;
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> user@spot.incubator.apache.org
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Subject: [Discuss] - Future plans for
>>>> Spot-ingest
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> How would you like to see Spot-ingest
>> change?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> A. continue development on the Python
>>>>>>>>>>>>>>>>>>>>>>>>>> Master/Worker
>>>>>>>> with
>>>>>>>>>>>>>>> focus
>>>>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> performance / error handling / logging B.
>>>> Develop
>>>>>>>>>>>>>>>>>>>>>>>>> Scala
>>>>>>>>>>>>>> based
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> ingest to
>>>>>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> inline with code base from ingest, ml, to OA
>>> (UI
>>>>> to
>>>>>>>>>>> continue
>>>>>>>>>>>>>>>>>>>>>>>>> being
>>>>>>>>>>>>>>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with
>> Scala
>>>>>>>>>>>>>>>>>>>>>>>>> based
>>>>>>>> Spark
>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>>> for normalization and input into DB
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Including the high level diagram:
>>>>>>>>>>>>>>>>>>>>>>>>>> +-----------------------------
>>>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> -------------------------------+
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> | +--------------------------+
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> +-----------------+        |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> | | Master                   |  A. B. C.
>>>>>>>>>>>>>>>> |
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Worker          |        |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> | |    A. Python
>>> +---------------+
>>>>>>> A.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> |   A.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Python     |        |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> | |    B. Scala              |
>>> |
>>>>>>>>>>>>>>>> +------------->
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>        +----+   |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> | |    C. Python             |
>>> |
>>>>>  |
>>>>>>>>>>>>>>>> |
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>        |    |   |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> | +---^------+---------------+
>>> |
>>>>>  |
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> +-----------------+    |   |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> |     |      |
>>> |
>>>>>  |
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>             |   |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> |     |      |
>>> |
>>>>>  |
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>             |   |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> |     |     +Note--------------+
>>> |
>>>>>  |
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> +-----------------+    |   |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> |     |     |Running on a      |
>>> |
>>>>>  |
>>>>>>>>>>>>>>>> |
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Spark
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Streaming |    |   |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> |     |     |worker node in    |
>>> |
>>>>>  |
>>>>>>>>>>> B.
>>>>>>>>>>>>>>> C.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> | B.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Scala        |    |   |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> |     |     |the Hadoop cluster|
>>> |
>>>>>  |
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> +--------> C.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Scala        +-+  |   |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> |     |     +------------------+
>>> |
>>>>>  |
>>>>>>>> |
>>>>>>>>>>>>>>>> |
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>        | |  |   |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> |   A.|
>>> |
>>>>>  |
>>>>>>>> |
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> +-----------------+ |  |   |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> |   B.|
>>> |
>>>>>  |
>>>>>>>> |
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>           |  |   |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> |   C.|
>>> |
>>>>>  |
>>>>>>>> |
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>           |  |   |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> | +----------------------+
>>>>>>>>> +-v------+----+----+-+
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> +--------------v--v-+ |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> | |                      |          |
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> |           |
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>                | |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> | |   Local FS:          |          |
>> hdfs
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> |           |
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hive / Impala    | |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> | |  - Binary/Text       |          |
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> |           |
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> - Parquet -     | |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> | |    Log files -       |          |
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> |           |
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>                | |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> | |                      |          |
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> |           |
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>                | |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> | +----------------------+
>>>>>>>>> +--------------------+
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> +-------------------+ |
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> +-----------------------------
>>>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> -------------------------------+
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Please let me know your thoughts,
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> - Nathanael
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Michael Ridley <mridley@cloudera.com>
>>>>>>>>>>>>> office: (650) 352-1337
>>>>>>>>>>>>> mobile: (571) 438-2420
>>>>>>>>>>>>> Senior Solutions Architect
>>>>>>>>>>>>> Cloudera, Inc.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Michael Ridley <mridley@cloudera.com>
>>>>>>>>>>> office: (650) 352-1337
>>>>>>>>>>> mobile: (571) 438-2420
>>>>>>>>>>> Senior Solutions Architect
>>>>>>>>>>> Cloudera, Inc.
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Michael Ridley <mridley@cloudera.com>
>> office: (650) 352-1337
>> mobile: (571) 438-2420
>> Senior Solutions Architect
>> Cloudera, Inc.
>> 


Mime
View raw message