spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Ridley <mrid...@cloudera.com>
Subject Re: [Discuss] - Future plans for Spot-ingest
Date Wed, 19 Apr 2017 20:45:51 GMT
One point I wanted to add the other day is that we probably do need to
write to Avro for the streaming ingest since Parquet doesn't do so great
with streaming ingest.  But then we can convert from Avro to Parquet using
whatever tool (SparkSQL, Hive, whatever) for query.  Whether the Avro is
persisted is an open question in my mind.

Michael

On Wed, Apr 19, 2017 at 4:22 PM, <kenneth@floss.cat> wrote:

>
> Replying to myself, AVRO for ingestion, Parquet for storage.
>
> Regards!
>
> Kenneth
>
> A 2017-04-19 22:05, Austin Leahy escrigué:
>
> I think Kafka is probably a red herring. It's an industry goto in the
>> application world because because of redundancy but the type and volumes
>> of
>> network telemetry that we are talking about here will bog kafka down
>> unless
>> you dedicate really serious hardware to just the kafka implementation.
>> It's
>> essentially the next level of the problem that the team was already
>> running
>> into when rabbitMQ was queueing in data.
>>
>> On Wed, Apr 19, 2017 at 12:33 PM Mark Grover <mark@apache.org> wrote:
>>
>> On Wed, Apr 19, 2017 at 10:19 AM, Smith, Nathanael P <
>>> nathanael.p.smith@intel.com> wrote:
>>>
>>> > Mark,
>>> >
>>> > just digesting the below.
>>> >
>>> > Backing up in my thought process, I was thinking that the ingest master
>>> > (first point of entry into the system) would want to put the data into
>>> a
>>> > standard serializable format. I was thinking that libraries (such as
>>> > pyarrow in this case) could help by writing the data in parquet format
>>> > early in the process. You are probably correct that at this point in
>>> time
>>> > it might not be worth the time and can be kept in the backlog.
>>> > That being said, I still think the master should produce data in a
>>> > standard format, what in your opinion (and I open this up of course to
>>> > others) would be the most logical format?
>>> > the most basic would be to just keep it as a .csv.
>>> >
>>> > The master will likely write data to a staging directory in HDFS where
>>> the
>>> > spark streaming job will pick it up for normalization/writing to
>>> parquet
>>> in
>>> > the correct block sizes and partitions.
>>> >
>>>
>>> Hi Nate,
>>> Avro is usually preferred for such a standard format - because it
>>> asserts a
>>> schema (types, etc.) which CSV doesn't and it allows for schema evolution
>>> which depending on the type of evolution, CSV may or may not support.
>>> And, that's something I have seen being done very commonly.
>>>
>>> Now, if the data were in Kafka before it gets to master, one could argue
>>> that the master could just send metadata to the workers (topic name,
>>> partition number, offset start and end) and the workers could read from
>>> Kafka directly. I do understand that'd be a much different architecture
>>> than the current one, but if you think it's a good idea too, we could
>>> document that, say in a JIRA, and (de-)prioritize it (and in line with
>>> the
>>> rest of the discussion on this thread, it's not the top-most priority).
>>> Thoughts?
>>>
>>> - Nathanael
>>> >
>>> >
>>> >
>>> > > On Apr 17, 2017, at 1:12 PM, Mark Grover <mark@apache.org> wrote:
>>> > >
>>> > > Thanks all your opinion.
>>> > >
>>> > > I think it's good to consider two things:
>>> > > 1. What do (we think) users care about?
>>> > > 2. What's the cost of changing things?
>>> > >
>>> > > About #1, I think users care more about what format data is written
>>> than
>>> > > how the data is written. I'd argue whether that uses Hive, MR, or a
>>> > custom
>>> > > Parquet writer is not as important to them as long as we maintain
>>> > > data/format compatibility.
>>> > > About #2, having worked on several projects, I find that it's rather
>>> > > difficult to keep up with Parquet. Even in Spark, there are a few
>>> > different
>>> > > ways to write to Parquet - there's a regular mode, and a legacy mode
>>> > > <https://github.com/apache/spark/blob/master/sql/core/
>>> > src/main/scala/org/apache/spark/sql/execution/datasources/parquet/
>>> > ParquetWriteSupport.scala#L44>
>>> > > which
>>> > > continues to cause confusion
>>> > > <https://issues.apache.org/jira/browse/SPARK-20297> till date.
>>> Parquet
>>> > > itself is pretty dependent on Hadoop
>>> > > <https://github.com/Parquet/parquet-mr/search?l=Maven+POM&
>>> > q=hadoop&type=&utf8=%E2%9C%93>
>>> > > and,
>>> > > just integrating it with systems with a lot of developers (like Spark
>>> > > <https://www.google.com/webhp?sourceid=chrome-instant&ion=1&
>>> > espv=2&ie=UTF-8#q=spark+parquet+jiras>)
>>> > > is still a lot of work.
>>> > >
>>> > > I personally think we should leverage higher level tools like Hive,
>>> or
>>> > > Spark to write data in widespread formats (Parquet, being a very good
>>> > > example) but I personally wouldn't encourage us to manage the writers
>>> > > ourselves.
>>> > >
>>> > > Thoughts?
>>> > > Mark
>>> > >
>>> > > On Mon, Apr 17, 2017 at 11:44 AM, Michael Ridley <
>>> mridley@cloudera.com
>>> >
>>> > > wrote:
>>> > >
>>> > >> Without having given it too terribly much thought, that seems like
>>> an
>>> OK
>>> > >> approach.
>>> > >>
>>> > >> Michael
>>> > >>
>>> > >> On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith <
>>> nathanael@apache.org>
>>> > >> wrote:
>>> > >>
>>> > >>> I think the question is rather we can write the data generically to
>>> > HDFS
>>> > >>> as parquet without the use of hive/impala?
>>> > >>>
>>> > >>> Today we write parquet data using the hive/mapreduce method.
>>> > >>> As part of the redesign i’d like to use libraries for this as
>>> opposed
>>> > to
>>> > >> a
>>> > >>> hadoop dependency.
>>> > >>> I think it would be preferred to use the python master to write the
>>> > data
>>> > >>> into the format we want, then do normalization of the data in spark
>>> > >>> streaming.
>>> > >>> Any thoughts?
>>> > >>>
>>> > >>> - Nathanael
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>> On Apr 17, 2017, at 11:08 AM, Michael Ridley <
>>> mridley@cloudera.com>
>>> > >>> wrote:
>>> > >>>>
>>> > >>>> I had thought that the plan was to write the data in Parquet in
>>> HDFS
>>> > >>>> ultimately.
>>> > >>>>
>>> > >>>> Michael
>>> > >>>>
>>> > >>>> On Sun, Apr 16, 2017 at 11:55 AM, kant kodali <kanth909@gmail.com
>>> >
>>> > >>> wrote:
>>> > >>>>
>>> > >>>>> Hi Mark,
>>> > >>>>>
>>> > >>>>> Thank you so much for hearing my argument. And I definetly
>>> understand
>>> > >>> that
>>> > >>>>> you guys have bunch of things to do. My only concern is that I
>>> hope
>>> > it
>>> > >>>>> doesn't take too long too support other backends. For example
>>> > @Kenneth
>>> > >>> had
>>> > >>>>> given an example of LAMP stack had not moved away from mysql yet
>>> > which
>>> > >>>>> essentially means its probably a decade ? I see that in the
>>> current
>>> > >>>>> architecture the results from with python multiprocessing or
>>> Spark
>>> > >>>>> Streaming are written back to HDFS and  If so, can we write them
>>> in
>>> > >>> parquet
>>> > >>>>> format ? such that users should be able to plug in any query
>>> engine
>>> > >> but
>>> > >>>>> again I am not pushing you guys to do this right away or anything
>>> > just
>>> > >>>>> seeing if there a way for me to get started in parallel and if
>>> not
>>> > >>>>> feasible, its fine I just wanted to share my 2 cents and I am
>>> glad
>>> my
>>> > >>>>> argument is heard!
>>> > >>>>>
>>> > >>>>> Thanks much!
>>> > >>>>>
>>> > >>>>> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <mark@apache.org>
>>> > wrote:
>>> > >>>>>
>>> > >>>>>> Hi Kant,
>>> > >>>>>> Just wanted to make sure you don't feel like we are ignoring
>>> your
>>> > >>>>>> comment:-) I hear you about pluggability.
>>> > >>>>>>
>>> > >>>>>> The design can and should be pluggable but the project has one
>>> stack
>>> > >> it
>>> > >>>>>> ships out of the box with, one stack that's the default stack in
>>> the
>>> > >>>>> sense
>>> > >>>>>> that it's the most tested and so on. And, for us, that's our
>>> current
>>> > >>>>> stack.
>>> > >>>>>> If we were to take Apache Hive as an example, it shipped (and
>>> ships)
>>> > >>> with
>>> > >>>>>> MapReduce as the default configuration engine. At some point,
>>> Apache
>>> > >>> Tez
>>> > >>>>>> came along and wanted Hive to run on Tez, so they made a bunch
>>> of
>>> > >>> things
>>> > >>>>>> pluggable to run Hive on Tez (instead of the only option
>>> up-until
>>> > >> then:
>>> > >>>>>> Hive-on-MR) and then Apache Spark came and re-used some of that
>>> > >>>>>> pluggability and even added some more so Hive-on-Spark could
>>> become
>>> > a
>>> > >>>>>> reality. In the same way, I don't think anyone disagrees here
>>> that
>>> > >>>>>> pluggabilty is a good thing but it's hard to do pluggability
>>> right,
>>> > >> and
>>> > >>>>> at
>>> > >>>>>> the right level, unless on has a clear use-case in mind.
>>> > >>>>>>
>>> > >>>>>> As a project, we have many things to do and I personally think
>>> the
>>> > >>>>> biggest
>>> > >>>>>> bang for the buck for us in making Spot a really solid and the
>>> best
>>> > >>> cyber
>>> > >>>>>> security solution isn't pluggability but the things we are
>>> working
>>> > on
>>> > >>> - a
>>> > >>>>>> better user interface, a common/unified approach to storing and
>>> > >>> modeling
>>> > >>>>>> data, etc.
>>> > >>>>>>
>>> > >>>>>> Having said that, we are open, if it's important to you or
>>> someone
>>> > >>> else,
>>> > >>>>>> we'd be happy to receive and review those patches.
>>> > >>>>>>
>>> > >>>>>> Thanks!
>>> > >>>>>> Mark
>>> > >>>>>>
>>> > >>>>>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali <
>>> kanth909@gmail.com
>>> >
>>> > >>>>> wrote:
>>> > >>>>>>
>>> > >>>>>>> Thanks Ross! and yes option C sounds good to me as well
>>> however I
>>> > >> just
>>> > >>>>>>> think Distributed Sql query engine  and the resource manager
>>> should
>>> > >> be
>>> > >>>>>>> pluggable.
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <
>>> > >> alan.d.ross@intel.com>
>>> > >>>>>>> wrote:
>>> > >>>>>>>
>>> > >>>>>>>> Option C is to use python on the front end of ingest pipeline
>>> and
>>> > >>>>>>>> spark/scala on the back end.
>>> > >>>>>>>>
>>> > >>>>>>>> Option A uses python workers on the backend
>>> > >>>>>>>>
>>> > >>>>>>>> Option B uses all scala.
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> -----Original Message-----
>>> > >>>>>>>> From: kant kodali [mailto:kanth909@gmail.com]
>>> > >>>>>>>> Sent: Friday, April 14, 2017 9:53 AM
>>> > >>>>>>>> To: dev@spot.incubator.apache.org
>>> > >>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
>>> > >>>>>>>>
>>> > >>>>>>>> What is option C ? am I missing an email or something?
>>> > >>>>>>>>
>>> > >>>>>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai <
>>> > >>>>>>>> chokha@integralops.com> wrote:
>>> > >>>>>>>>
>>> > >>>>>>>>> +1 for Python 3.x
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote:
>>> > >>>>>>>>>
>>> > >>>>>>>>>> I think that C is the strong solution, getting the ingest
>>> really
>>> > >>>>>>>>>> strong is going to lower barriers to adoption. Doing it in
>>> > Python
>>> > >>>>>>>>>> will open up the ingest portion of the project to include
>>> many
>>> > >>>>> more
>>> > >>>>>>>> developers.
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Before it comes up I would like to throw the following on
>>> the
>>> > >>>>>> pile...
>>> > >>>>>>>>>> Major
>>> > >>>>>>>>>> python projects django/flash, others are dropping 2.x
>>> support
>>> in
>>> > >>>>>>>>>> releases scheduled in the next 6 to 8 months. Hadoop
>>> projects
>>> in
>>> > >>>>>>>>>> general tend to lag in modern python support, lets please
>>> build
>>> > >>>>> this
>>> > >>>>>>>>>> in 3.5 so that we don't have to immediately expect a rebuild
>>> in
>>> > >>>>> the
>>> > >>>>>>>>>> pipeline.
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> -Vote C
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Thanks Nate
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> Austin
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <alan@apache.org>
>>> > >>>>> wrote:
>>> > >>>>>>>>>>
>>> > >>>>>>>>>> I really like option C because it gives a lot of flexibility
>>> for
>>> > >>>>>>>>>> ingest
>>> > >>>>>>>>>>> (python vs scala) but still has the robust spark streaming
>>> > >>>>> backend
>>> > >>>>>>>>>>> for performance.
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> Thanks for putting this together Nate.
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> Alan
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
>>> > >>>>>>>>>>> chokha@integralops.com> wrote:
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>> I agree. We should continue making the existing stack more
>>> > >> mature
>>> > >>>>>> at
>>> > >>>>>>>>>>>> this point. Maybe if we have enough community support we
>>> can
>>> > >> add
>>> > >>>>>>>>>>>> additional datastores.
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>> Chokha.
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>> On 4/14/17 11:10 AM, kenneth@floss.cat wrote:
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>>>>> Hi Kant,
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> YARN is the standard scheduler in Hadoop. If you're using
>>> > >>>>>>>>>>>>> Hive+Spark, then sure you'll have YARN.
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is
>>> based
>>> > >>>>> on
>>> > >>>>>> a
>>> > >>>>>>>>>>>>> quite standard Hadoop stack and I wouldn't switch too
>>> many
>>> > >>>>> pieces
>>> > >>>>>>>> yet.
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> In most Opensource projects you start relying on a
>>> well-known
>>> > >>>>>>>>>>>>> stack and then you begin to support other DB backends
>>> once
>>> > >> it's
>>> > >>>>>>>>>>>>> quite mature. Think in the loads of LAMP apps which
>>> haven't
>>> > >>>>> been
>>> > >>>>>>>>>>>>> ported away from MySQL yet.
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> In any case, you'll need a high performance SQL + Massive
>>> > >>>>> Storage
>>> > >>>>>>>>>>>>> + Machine Learning + Massive Ingestion, and... ATM, that
>>> can
>>> > >> be
>>> > >>>>>>>>>>>>> only provided by Hadoop.
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> Regards!
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> Kenneth
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué:
>>> > >>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> Hi Kenneth,
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> Thanks for the response.  I think you made a case for
>>> HDFS
>>> > >>>>>>>>>>>>>> however users may want to use S3 or some other FS in
>>> which
>>> > >>>>> case
>>> > >>>>>>>>>>>>>> they can use Auxilio (hoping that there are no changes
>>> > needed
>>> > >>>>>>>>>>>>>> within Spot in which case I
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> can
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>> agree to that). for example, Netflix stores all there data
>>> > into
>>> > >>>>> S3
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> The distributed sql query engine I would say should be
>>> > >>>>> pluggable
>>> > >>>>>>>>>>>>>> with whatever user may want to use and there a bunch of
>>> them
>>> > >>>>> out
>>> > >>>>>>>> there.
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> sure
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>> Impala is better than hive but what if users are already
>>> using
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> something
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>> else like Drill or Presto?
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> Me personally, would not assume that users are willing
>>> to
>>> > >>>>> deploy
>>> > >>>>>>>>>>>>>> all
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> of
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>> that and make their existing stack more complicated at
>>> very
>>> > >>>>> least
>>> > >>>>>> I
>>> > >>>>>>>>>>>>>> would
>>> > >>>>>>>>>>>>>> say it is a uphill battle. Things have been changing
>>> rapidly
>>> > >>>>> in
>>> > >>>>>>>>>>>>>> Big
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> data
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>> space so whatever we think is standard won't be standard
>>> > >> anymore
>>> > >>>>>>>>>>>> but
>>> > >>>>>>>>>>>>>> importantly there shouldn't be any reason why we
>>> shouldn't
>>> > be
>>> > >>>>>>>>>>>>>> flexible right.
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> Also I am not sure why only YARN? why not make that also
>>> > more
>>> > >>>>>>>>>>>>>> flexible so users can pick Mesos or standalone.
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> I think Flexibility is a key for a wide adoption rather
>>> than
>>> > >>>>> the
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>> tightly
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>> coupled architecture.
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> Thanks!
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza
>>> > >>>>>>>>>>>>>> <kenneth@floss.cat>
>>> > >>>>>>>>>>>>>> wrote:
>>> > >>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> PS: you need a big data platform to be able to collect
>>> all
>>> > >>>>> those
>>> > >>>>>>>>>>>>>>> netflows
>>> > >>>>>>>>>>>>>>> and logs.
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> Spot isn't intended for SMBs, that's clear, then you
>>> need
>>> > >>>>> loads
>>> > >>>>>>>>>>>>>>> of data to get ML working properly, and somewhere to
>>> run
>>> > >>>>> those
>>> > >>>>>>>>>>>>>>> algorithms. That
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> is
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>> Hadoop.
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> Regards!
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> Kenneth
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> Sent from my Mi phone
>>> > >>>>>>>>>>>>>>> On kant kodali <kanth909@gmail.com>, Apr 14, 2017 4:04
>>> AM
>>> > >>>>>> wrote:
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> Hi,
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> Thanks for starting this thread. Here is my feedback.
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> I somehow think the architecture is too complicated for
>>> > wide
>>> > >>>>>>>>>>>>>>> adoption since it requires to install the following.
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> HDFS.
>>> > >>>>>>>>>>>>>>> HIVE.
>>> > >>>>>>>>>>>>>>> IMPALA.
>>> > >>>>>>>>>>>>>>> KAFKA.
>>> > >>>>>>>>>>>>>>> SPARK (YARN).
>>> > >>>>>>>>>>>>>>> YARN.
>>> > >>>>>>>>>>>>>>> Zookeeper.
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> Currently there are way too many dependencies that
>>> > >>>>> discourages
>>> > >>>>>>>>>>>>>>> lot of users from using it because they have to go
>>> through
>>> > >>>>>>>>>>>>>>> deployment of all that required software. I think for
>>> wide
>>> > >>>>>>>>>>>>>>> option we should minimize the dependencies and have
>>> more
>>> > >>>>>>>>>>>>>>> pluggable architecture. for example I am
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> not
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>> sure why HIVE & IMPALA both are required? why not just use
>>> > >> Spark
>>> > >>>>>>>>>>>> SQL
>>> > >>>>>>>>>>>>>>> since
>>> > >>>>>>>>>>>>>>> its already dependency or say users may want to use
>>> their
>>> > >> own
>>> > >>>>>>>>>>>>>>> distributed query engine they like such as Apache Drill
>>> or
>>> > >>>>>>>>>>>>>>> something else. we should be flexible enough to provide
>>> > that
>>> > >>>>>>>>>>>>>>> option
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> Also, I see that HDFS is used such that collectors can
>>> > >>>>> receive
>>> > >>>>>>>>>>>>>>> file path's through Kafka and be able to read a file.
>>> How
>>> > >> big
>>> > >>>>>>>>>>>>>>> are these files ?
>>> > >>>>>>>>>>>>>>> Do we
>>> > >>>>>>>>>>>>>>> really need HDFS for this? Why not provide more ways to
>>> > send
>>> > >>>>>>>>>>>>>>> data such as sending data directly through Kafka or say
>>> > just
>>> > >>>>>>>>>>>>>>> leaving up to the user to specify the file location as
>>> an
>>> > >>>>>>>>>>>>>>> argument to collector process
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> Finally, I learnt that to generate Net flow data one
>>> would
>>> > >>>>>>>>>>>>>>> require a specific hardware. This really means Apache
>>> Spot
>>> > >> is
>>> > >>>>>>>>>>>>>>> not meant for everyone.
>>> > >>>>>>>>>>>>>>> I thought Apache Spot can be used to analyze the
>>> network
>>> > >>>>>> traffic
>>> > >>>>>>>>>>>>>>> of
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>> any
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>> machine but if it requires a specific hard then I think it
>>> is
>>> > >>>>>>>>>>>>>>> targeted for
>>> > >>>>>>>>>>>>>>> specific group of people.
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> The real strength of Apache Spot should mainly be just
>>> > >>>>>> analyzing
>>> > >>>>>>>>>>>>>>> network traffic through ML.
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> Thanks!
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L <
>>> > >>>>>>>>>>>>>>> nathan.l.segerlind@intel.com> wrote:
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> Thanks, Nate,
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> Nate.
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> -----Original Message-----
>>> > >>>>>>>>>>>>>>>> From: Nate Smith [mailto:natedogs911@gmail.com]
>>> > >>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
>>> > >>>>>>>>>>>>>>>> To: user@spot.incubator.apache.org
>>> > >>>>>>>>>>>>>>>> Cc: dev@spot.incubator.apache.org;
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> private@spot.incubator.apache.org
>>> > >>>>>>>>>>>
>>> > >>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> I was really hoping it came through ok, Oh well :)
>>> Here’s
>>> > >> an
>>> > >>>>>>>>>>>>>>>> image form:
>>> > >>>>>>>>>>>>>>>> http://imgur.com/a/DUDsD
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L <
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> nathan.l.segerlind@intel.com> wrote:
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> The diagram became garbled in the text format.
>>> > >>>>>>>>>>>>>>>>> Could you resend it as a pdf?
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> Thanks,
>>> > >>>>>>>>>>>>>>>>> Nate
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> -----Original Message-----
>>> > >>>>>>>>>>>>>>>>> From: Nathanael Smith [mailto:nathanael@apache.org]
>>> > >>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
>>> > >>>>>>>>>>>>>>>>> To: private@spot.incubator.apache.org;
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> dev@spot.incubator.apache.org;
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> user@spot.incubator.apache.org
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> How would you like to see Spot-ingest change?
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> A. continue development on the Python Master/Worker
>>> with
>>> > >>>>>> focus
>>> > >>>>>>>>>>>>>>>>> on
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> performance / error handling / logging B. Develop
>>> Scala
>>> > >>>>> based
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>> ingest to
>>> > >>>>>>>>>>>>>>> be
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> inline with code base from ingest, ml, to OA (UI to
>>> > >> continue
>>> > >>>>>>>>>>>>>>>> being
>>> > >>>>>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with Scala based
>>> Spark
>>> > >>>>>> code
>>> > >>>>>>>>>>>>>>>> for normalization and input into DB
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> Including the high level diagram:
>>> > >>>>>>>>>>>>>>>>> +-----------------------------
>>> > >>>>> ------------------------------
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> -------------------------------+
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> | +--------------------------+
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> +-----------------+        |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> | | Master                   |  A. B. C.
>>> > >>>>>>> |
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> Worker          |        |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> | |    A. Python             +---------------+
>>> A.
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> |   A.
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> Python     |        |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> | |    B. Scala              |               |
>>> > >>>>>>> +------------->
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>         +----+   |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> | |    C. Python             |               |    |
>>> > >>>>>>> |
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>         |    |   |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> | +---^------+---------------+               |    |
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> +-----------------+    |   |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> |     |      |                               |    |
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>              |   |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> |     |      |                               |    |
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>              |   |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> |     |     +Note--------------+             |    |
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> +-----------------+    |   |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> |     |     |Running on a      |             |    |
>>> > >>>>>>> |
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> Spark
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> Streaming |    |   |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> |     |     |worker node in    |             |    |
>>> > >> B.
>>> > >>>>>> C.
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> | B.
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> Scala        |    |   |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> |     |     |the Hadoop cluster|             |    |
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> +--------> C.
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> Scala        +-+  |   |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> |     |     +------------------+             |    |
>>> |
>>> > >>>>>>> |
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>         | |  |   |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> |   A.|                                      |    |
>>> |
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> +-----------------+ |  |   |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> |   B.|                                      |    |
>>> |
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>            |  |   |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> |   C.|                                      |    |
>>> |
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>            |  |   |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> | +----------------------+
>>> > +-v------+----+----+-+
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> +--------------v--v-+ |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> | |                      |          |
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> |           |
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>                 | |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> | |   Local FS:          |          |    hdfs
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> |           |
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> Hive / Impala    | |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> | |  - Binary/Text       |          |
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> |           |
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> - Parquet -     | |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> | |    Log files -       |          |
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> |           |
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>                 | |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> | |                      |          |
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> |           |
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>                 | |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> | +----------------------+
>>> > +--------------------+
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> +-------------------+ |
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> +-----------------------------
>>> > >>>>> ------------------------------
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>> -------------------------------+
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> Please let me know your thoughts,
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>> - Nathanael
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>>>>
>>> > >>>>>>>>>>>>
>>> > >>>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>
>>> > >>>>>>
>>> > >>>>>
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>> --
>>> > >>>> Michael Ridley <mridley@cloudera.com>
>>> > >>>> office: (650) 352-1337
>>> > >>>> mobile: (571) 438-2420
>>> > >>>> Senior Solutions Architect
>>> > >>>> Cloudera, Inc.
>>> > >>>
>>> > >>>
>>> > >>
>>> > >>
>>> > >> --
>>> > >> Michael Ridley <mridley@cloudera.com>
>>> > >> office: (650) 352-1337
>>> > >> mobile: (571) 438-2420
>>> > >> Senior Solutions Architect
>>> > >> Cloudera, Inc.
>>> > >>
>>> >
>>> >
>>>
>>>
>


-- 
Michael Ridley <mridley@cloudera.com>
office: (650) 352-1337
mobile: (571) 438-2420
Senior Solutions Architect
Cloudera, Inc.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message