spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kant kodali <kanth...@gmail.com>
Subject Re: [Discuss] - Future plans for Spot-ingest
Date Fri, 14 Apr 2017 16:53:22 GMT
What is option C ? am I missing an email or something?

On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai <
chokha@integralops.com> wrote:

> +1 for Python 3.x
>
>
>
> On 4/14/2017 11:59 AM, Austin Leahy wrote:
>
>> I think that C is the strong solution, getting the ingest really strong is
>> going to lower barriers to adoption. Doing it in Python will open up the
>> ingest portion of the project to include many more developers.
>>
>> Before it comes up I would like to throw the following on the pile...
>> Major
>> python projects django/flash, others are dropping 2.x support in releases
>> scheduled in the next 6 to 8 months. Hadoop projects in general tend to
>> lag
>> in modern python support, lets please build this in 3.5 so that we don't
>> have to immediately expect a rebuild in the pipeline.
>>
>> -Vote C
>>
>> Thanks Nate
>>
>> Austin
>>
>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <alan@apache.org> wrote:
>>
>> I really like option C because it gives a lot of flexibility for ingest
>>> (python vs scala) but still has the robust spark streaming backend for
>>> performance.
>>>
>>> Thanks for putting this together Nate.
>>>
>>> Alan
>>>
>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
>>> chokha@integralops.com> wrote:
>>>
>>> I agree. We should continue making the existing stack more mature at
>>>> this point. Maybe if we have enough community support we can add
>>>> additional datastores.
>>>>
>>>> Chokha.
>>>>
>>>>
>>>> On 4/14/17 11:10 AM, kenneth@floss.cat wrote:
>>>>
>>>>> Hi Kant,
>>>>>
>>>>>
>>>>> YARN is the standard scheduler in Hadoop. If you're using Hive+Spark,
>>>>> then sure you'll have YARN.
>>>>>
>>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is based on a
>>>>> quite standard Hadoop stack and I wouldn't switch too many pieces yet.
>>>>>
>>>>> In most Opensource projects you start relying on a well-known stack
>>>>> and then you begin to support other DB backends once it's quite
>>>>> mature. Think in the loads of LAMP apps which haven't been ported away
>>>>> from MySQL yet.
>>>>>
>>>>> In any case, you'll need a high performance SQL + Massive Storage +
>>>>> Machine Learning + Massive Ingestion, and... ATM, that can be only
>>>>> provided by Hadoop.
>>>>>
>>>>> Regards!
>>>>>
>>>>> Kenneth
>>>>>
>>>>> A 2017-04-14 12:56, kant kodali escrigué:
>>>>>
>>>>>> Hi Kenneth,
>>>>>>
>>>>>> Thanks for the response.  I think you made a case for HDFS  however
>>>>>> users
>>>>>> may want to use S3 or some other FS in which case they can use Auxilio
>>>>>> (hoping that there are no changes needed within Spot in which case
I
>>>>>>
>>>>> can
>>>
>>>> agree to that). for example, Netflix stores all there data into S3
>>>>>>
>>>>>> The distributed sql query engine I would say should be pluggable
with
>>>>>> whatever user may want to use and there a bunch of them out there.
>>>>>>
>>>>> sure
>>>
>>>> Impala is better than hive but what if users are already using
>>>>>>
>>>>> something
>>>
>>>> else like Drill or Presto?
>>>>>>
>>>>>> Me personally, would not assume that users are willing to deploy
all
>>>>>>
>>>>> of
>>>
>>>> that and make their existing stack more complicated at very least I
>>>>>> would
>>>>>> say it is a uphill battle. Things have been changing rapidly in Big
>>>>>>
>>>>> data
>>>
>>>> space so whatever we think is standard won't be standard anymore but
>>>>>> importantly there shouldn't be any reason why we shouldn't be flexible
>>>>>> right.
>>>>>>
>>>>>> Also I am not sure why only YARN? why not make that also more
>>>>>> flexible so
>>>>>> users can pick Mesos or standalone.
>>>>>>
>>>>>> I think Flexibility is a key for a wide adoption rather than the
>>>>>>
>>>>> tightly
>>>
>>>> coupled architecture.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza <kenneth@floss.cat>
>>>>>> wrote:
>>>>>>
>>>>>> PS: you need a big data platform to be able to collect all those
>>>>>>> netflows
>>>>>>> and logs.
>>>>>>>
>>>>>>> Spot isn't intended for SMBs, that's clear, then you need loads
of
>>>>>>> data to
>>>>>>> get ML working properly, and somewhere to run those algorithms.
That
>>>>>>>
>>>>>> is
>>>
>>>> Hadoop.
>>>>>>>
>>>>>>> Regards!
>>>>>>>
>>>>>>> Kenneth
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Sent from my Mi phone
>>>>>>> On kant kodali <kanth909@gmail.com>, Apr 14, 2017 4:04
AM wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Thanks for starting this thread. Here is my feedback.
>>>>>>>
>>>>>>> I somehow think the architecture is too complicated for wide
adoption
>>>>>>> since
>>>>>>> it requires to install the following.
>>>>>>>
>>>>>>> HDFS.
>>>>>>> HIVE.
>>>>>>> IMPALA.
>>>>>>> KAFKA.
>>>>>>> SPARK (YARN).
>>>>>>> YARN.
>>>>>>> Zookeeper.
>>>>>>>
>>>>>>> Currently there are way too many dependencies that discourages
lot of
>>>>>>> users
>>>>>>> from using it because they have to go through deployment of all
that
>>>>>>> required software. I think for wide option we should minimize
the
>>>>>>> dependencies and have more pluggable architecture. for example
I am
>>>>>>>
>>>>>> not
>>>
>>>> sure why HIVE & IMPALA both are required? why not just use Spark SQL
>>>>>>> since
>>>>>>> its already dependency or say users may want to use their own
>>>>>>> distributed
>>>>>>> query engine they like such as Apache Drill or something else.
we
>>>>>>> should
>>>>>>> be
>>>>>>> flexible enough to provide that option
>>>>>>>
>>>>>>> Also, I see that HDFS is used such that collectors can receive
file
>>>>>>> path's
>>>>>>> through Kafka and be able to read a file. How big are these files
?
>>>>>>> Do we
>>>>>>> really need HDFS for this? Why not provide more ways to send
data
>>>>>>> such as
>>>>>>> sending data directly through Kafka or say just leaving up to
the
>>>>>>> user to
>>>>>>> specify the file location as an argument to collector process
>>>>>>>
>>>>>>> Finally, I learnt that to generate Net flow data one would require
a
>>>>>>> specific hardware. This really means Apache Spot is not meant
for
>>>>>>> everyone.
>>>>>>> I thought Apache Spot can be used to analyze the network traffic
of
>>>>>>>
>>>>>> any
>>>
>>>> machine but if it requires a specific hard then I think it is
>>>>>>> targeted for
>>>>>>> specific group of people.
>>>>>>>
>>>>>>> The real strength of Apache Spot should mainly be just analyzing
>>>>>>> network
>>>>>>> traffic through ML.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L <
>>>>>>> nathan.l.segerlind@intel.com> wrote:
>>>>>>>
>>>>>>> Thanks, Nate,
>>>>>>>>
>>>>>>>> Nate.
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Nate Smith [mailto:natedogs911@gmail.com]
>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
>>>>>>>> To: user@spot.incubator.apache.org
>>>>>>>> Cc: dev@spot.incubator.apache.org;
>>>>>>>>
>>>>>>> private@spot.incubator.apache.org
>>>
>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
>>>>>>>>
>>>>>>>> I was really hoping it came through ok,
>>>>>>>> Oh well :)
>>>>>>>> Here’s an image form:
>>>>>>>> http://imgur.com/a/DUDsD
>>>>>>>>
>>>>>>>>
>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L <
>>>>>>>>>
>>>>>>>> nathan.l.segerlind@intel.com> wrote:
>>>>>>>>
>>>>>>>>> The diagram became garbled in the text format.
>>>>>>>>> Could you resend it as a pdf?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Nate
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Nathanael Smith [mailto:nathanael@apache.org]
>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
>>>>>>>>> To: private@spot.incubator.apache.org;
>>>>>>>>>
>>>>>>>> dev@spot.incubator.apache.org;
>>>>>>>
>>>>>>>> user@spot.incubator.apache.org
>>>>>>>>
>>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest
>>>>>>>>>
>>>>>>>>> How would you like to see Spot-ingest change?
>>>>>>>>>
>>>>>>>>> A. continue development on the Python Master/Worker with
focus on
>>>>>>>>>
>>>>>>>> performance / error handling / logging B. Develop Scala based
>>>>>>>>
>>>>>>> ingest to
>>>>>>> be
>>>>>>>
>>>>>>>> inline with code base from ingest, ml, to OA (UI to continue
being
>>>>>>>> ipython/JS) C. Python ingest Worker with Scala based Spark
code for
>>>>>>>> normalization and input into DB
>>>>>>>>
>>>>>>>>> Including the high level diagram:
>>>>>>>>> +-----------------------------------------------------------
>>>>>>>>>
>>>>>>>> -------------------------------+
>>>>>>>>
>>>>>>>>> | +--------------------------+
>>>>>>>>>
>>>>>>>> +-----------------+        |
>>>>>>>>
>>>>>>>>> | | Master                   |  A. B. C.            
           |
>>>>>>>>>
>>>>>>>> Worker          |        |
>>>>>>>>
>>>>>>>>> | |    A. Python             +---------------+      A.
>>>>>>>>>
>>>>>>>> |   A.
>>>>>>>
>>>>>>>> Python     |        |
>>>>>>>>
>>>>>>>>> | |    B. Scala              |               |    +------------->
>>>>>>>>>
>>>>>>>>           +----+   |
>>>>>>>>
>>>>>>>>> | |    C. Python             |               |    | 
           |
>>>>>>>>>
>>>>>>>>           |    |   |
>>>>>>>>
>>>>>>>>> | +---^------+---------------+               |    |
>>>>>>>>>
>>>>>>>>   +-----------------+    |   |
>>>>>>>>
>>>>>>>>> |     |      |                               |    |
>>>>>>>>>
>>>>>>>>                |   |
>>>>>>>>
>>>>>>>>> |     |      |                               |    |
>>>>>>>>>
>>>>>>>>                |   |
>>>>>>>>
>>>>>>>>> |     |     +Note--------------+             |    |
>>>>>>>>>
>>>>>>>>   +-----------------+    |   |
>>>>>>>>
>>>>>>>>> |     |     |Running on a      |             |    | 
           |
>>>>>>>>>
>>>>>>>> Spark
>>>>>>>
>>>>>>>> Streaming |    |   |
>>>>>>>>
>>>>>>>>> |     |     |worker node in    |             |    | 
    B. C.
>>>>>>>>>
>>>>>>>> | B.
>>>>>>>
>>>>>>>> Scala        |    |   |
>>>>>>>>
>>>>>>>>> |     |     |the Hadoop cluster|             |    |
>>>>>>>>>
>>>>>>>> +--------> C.
>>>>>>>
>>>>>>>> Scala        +-+  |   |
>>>>>>>>
>>>>>>>>> |     |     +------------------+             |    | 
  |        |
>>>>>>>>>
>>>>>>>>           | |  |   |
>>>>>>>>
>>>>>>>>> |   A.|                                      |    | 
  |
>>>>>>>>>
>>>>>>>> +-----------------+ |  |   |
>>>>>>>>
>>>>>>>>> |   B.|                                      |    | 
  |
>>>>>>>>>
>>>>>>>>              |  |   |
>>>>>>>>
>>>>>>>>> |   C.|                                      |    | 
  |
>>>>>>>>>
>>>>>>>>              |  |   |
>>>>>>>>
>>>>>>>>> | +----------------------+          +-v------+----+----+-+
>>>>>>>>>
>>>>>>>>   +--------------v--v-+ |
>>>>>>>>
>>>>>>>>> | |                      |          |
>>>>>>>>>
>>>>>>>> |           |
>>>>>>>
>>>>>>>>                   | |
>>>>>>>>
>>>>>>>>> | |   Local FS:          |          |    hdfs
>>>>>>>>>
>>>>>>>> |           |
>>>>>>>
>>>>>>>> Hive / Impala    | |
>>>>>>>>
>>>>>>>>> | |  - Binary/Text       |          |
>>>>>>>>>
>>>>>>>> |           |
>>>>>>>
>>>>>>>>   - Parquet -     | |
>>>>>>>>
>>>>>>>>> | |    Log files -       |          |
>>>>>>>>>
>>>>>>>> |           |
>>>>>>>
>>>>>>>>                   | |
>>>>>>>>
>>>>>>>>> | |                      |          |
>>>>>>>>>
>>>>>>>> |           |
>>>>>>>
>>>>>>>>                   | |
>>>>>>>>
>>>>>>>>> | +----------------------+          +--------------------+
>>>>>>>>>
>>>>>>>>   +-------------------+ |
>>>>>>>>
>>>>>>>>> +-----------------------------------------------------------
>>>>>>>>>
>>>>>>>> -------------------------------+
>>>>>>>>
>>>>>>>>> Please let me know your thoughts,
>>>>>>>>>
>>>>>>>>> - Nathanael
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message