spot-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chokha Palayamkottai <cho...@integralops.com>
Subject Re: [Discuss] - Future plans for Spot-ingest
Date Fri, 14 Apr 2017 16:15:30 GMT
+1 for Python 3.x


On 4/14/2017 11:59 AM, Austin Leahy wrote:
> I think that C is the strong solution, getting the ingest really strong is
> going to lower barriers to adoption. Doing it in Python will open up the
> ingest portion of the project to include many more developers.
>
> Before it comes up I would like to throw the following on the pile... Major
> python projects django/flash, others are dropping 2.x support in releases
> scheduled in the next 6 to 8 months. Hadoop projects in general tend to lag
> in modern python support, lets please build this in 3.5 so that we don't
> have to immediately expect a rebuild in the pipeline.
>
> -Vote C
>
> Thanks Nate
>
> Austin
>
> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <alan@apache.org> wrote:
>
>> I really like option C because it gives a lot of flexibility for ingest
>> (python vs scala) but still has the robust spark streaming backend for
>> performance.
>>
>> Thanks for putting this together Nate.
>>
>> Alan
>>
>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
>> chokha@integralops.com> wrote:
>>
>>> I agree. We should continue making the existing stack more mature at
>>> this point. Maybe if we have enough community support we can add
>>> additional datastores.
>>>
>>> Chokha.
>>>
>>>
>>> On 4/14/17 11:10 AM, kenneth@floss.cat wrote:
>>>> Hi Kant,
>>>>
>>>>
>>>> YARN is the standard scheduler in Hadoop. If you're using Hive+Spark,
>>>> then sure you'll have YARN.
>>>>
>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is based on a
>>>> quite standard Hadoop stack and I wouldn't switch too many pieces yet.
>>>>
>>>> In most Opensource projects you start relying on a well-known stack
>>>> and then you begin to support other DB backends once it's quite
>>>> mature. Think in the loads of LAMP apps which haven't been ported away
>>>> from MySQL yet.
>>>>
>>>> In any case, you'll need a high performance SQL + Massive Storage +
>>>> Machine Learning + Massive Ingestion, and... ATM, that can be only
>>>> provided by Hadoop.
>>>>
>>>> Regards!
>>>>
>>>> Kenneth
>>>>
>>>> A 2017-04-14 12:56, kant kodali escrigué:
>>>>> Hi Kenneth,
>>>>>
>>>>> Thanks for the response.  I think you made a case for HDFS  however
>>>>> users
>>>>> may want to use S3 or some other FS in which case they can use Auxilio
>>>>> (hoping that there are no changes needed within Spot in which case I
>> can
>>>>> agree to that). for example, Netflix stores all there data into S3
>>>>>
>>>>> The distributed sql query engine I would say should be pluggable with
>>>>> whatever user may want to use and there a bunch of them out there.
>> sure
>>>>> Impala is better than hive but what if users are already using
>> something
>>>>> else like Drill or Presto?
>>>>>
>>>>> Me personally, would not assume that users are willing to deploy all
>> of
>>>>> that and make their existing stack more complicated at very least I
>>>>> would
>>>>> say it is a uphill battle. Things have been changing rapidly in Big
>> data
>>>>> space so whatever we think is standard won't be standard anymore but
>>>>> importantly there shouldn't be any reason why we shouldn't be flexible
>>>>> right.
>>>>>
>>>>> Also I am not sure why only YARN? why not make that also more
>>>>> flexible so
>>>>> users can pick Mesos or standalone.
>>>>>
>>>>> I think Flexibility is a key for a wide adoption rather than the
>> tightly
>>>>> coupled architecture.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza <kenneth@floss.cat>
>>>>> wrote:
>>>>>
>>>>>> PS: you need a big data platform to be able to collect all those
>>>>>> netflows
>>>>>> and logs.
>>>>>>
>>>>>> Spot isn't intended for SMBs, that's clear, then you need loads of
>>>>>> data to
>>>>>> get ML working properly, and somewhere to run those algorithms. That
>> is
>>>>>> Hadoop.
>>>>>>
>>>>>> Regards!
>>>>>>
>>>>>> Kenneth
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sent from my Mi phone
>>>>>> On kant kodali <kanth909@gmail.com>, Apr 14, 2017 4:04 AM wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Thanks for starting this thread. Here is my feedback.
>>>>>>
>>>>>> I somehow think the architecture is too complicated for wide adoption
>>>>>> since
>>>>>> it requires to install the following.
>>>>>>
>>>>>> HDFS.
>>>>>> HIVE.
>>>>>> IMPALA.
>>>>>> KAFKA.
>>>>>> SPARK (YARN).
>>>>>> YARN.
>>>>>> Zookeeper.
>>>>>>
>>>>>> Currently there are way too many dependencies that discourages lot
of
>>>>>> users
>>>>>> from using it because they have to go through deployment of all that
>>>>>> required software. I think for wide option we should minimize the
>>>>>> dependencies and have more pluggable architecture. for example I
am
>> not
>>>>>> sure why HIVE & IMPALA both are required? why not just use Spark
SQL
>>>>>> since
>>>>>> its already dependency or say users may want to use their own
>>>>>> distributed
>>>>>> query engine they like such as Apache Drill or something else. we
>>>>>> should
>>>>>> be
>>>>>> flexible enough to provide that option
>>>>>>
>>>>>> Also, I see that HDFS is used such that collectors can receive file
>>>>>> path's
>>>>>> through Kafka and be able to read a file. How big are these files
?
>>>>>> Do we
>>>>>> really need HDFS for this? Why not provide more ways to send data
>>>>>> such as
>>>>>> sending data directly through Kafka or say just leaving up to the
>>>>>> user to
>>>>>> specify the file location as an argument to collector process
>>>>>>
>>>>>> Finally, I learnt that to generate Net flow data one would require
a
>>>>>> specific hardware. This really means Apache Spot is not meant for
>>>>>> everyone.
>>>>>> I thought Apache Spot can be used to analyze the network traffic
of
>> any
>>>>>> machine but if it requires a specific hard then I think it is
>>>>>> targeted for
>>>>>> specific group of people.
>>>>>>
>>>>>> The real strength of Apache Spot should mainly be just analyzing
>>>>>> network
>>>>>> traffic through ML.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L <
>>>>>> nathan.l.segerlind@intel.com> wrote:
>>>>>>
>>>>>>> Thanks, Nate,
>>>>>>>
>>>>>>> Nate.
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Nate Smith [mailto:natedogs911@gmail.com]
>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
>>>>>>> To: user@spot.incubator.apache.org
>>>>>>> Cc: dev@spot.incubator.apache.org;
>> private@spot.incubator.apache.org
>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
>>>>>>>
>>>>>>> I was really hoping it came through ok,
>>>>>>> Oh well :)
>>>>>>> Here’s an image form:
>>>>>>> http://imgur.com/a/DUDsD
>>>>>>>
>>>>>>>
>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L <
>>>>>>> nathan.l.segerlind@intel.com> wrote:
>>>>>>>> The diagram became garbled in the text format.
>>>>>>>> Could you resend it as a pdf?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Nate
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Nathanael Smith [mailto:nathanael@apache.org]
>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
>>>>>>>> To: private@spot.incubator.apache.org;
>>>>>> dev@spot.incubator.apache.org;
>>>>>>> user@spot.incubator.apache.org
>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest
>>>>>>>>
>>>>>>>> How would you like to see Spot-ingest change?
>>>>>>>>
>>>>>>>> A. continue development on the Python Master/Worker with
focus on
>>>>>>> performance / error handling / logging B. Develop Scala based
>>>>>> ingest to
>>>>>> be
>>>>>>> inline with code base from ingest, ml, to OA (UI to continue
being
>>>>>>> ipython/JS) C. Python ingest Worker with Scala based Spark code
for
>>>>>>> normalization and input into DB
>>>>>>>> Including the high level diagram:
>>>>>>>> +-----------------------------------------------------------
>>>>>>> -------------------------------+
>>>>>>>> | +--------------------------+
>>>>>>> +-----------------+        |
>>>>>>>> | | Master                   |  A. B. C.                
       |
>>>>>>> Worker          |        |
>>>>>>>> | |    A. Python             +---------------+      A.
>>>>>> |   A.
>>>>>>> Python     |        |
>>>>>>>> | |    B. Scala              |               |    +------------->
>>>>>>>           +----+   |
>>>>>>>> | |    C. Python             |               |    |     
       |
>>>>>>>           |    |   |
>>>>>>>> | +---^------+---------------+               |    |
>>>>>>>   +-----------------+    |   |
>>>>>>>> |     |      |                               |    |
>>>>>>>                |   |
>>>>>>>> |     |      |                               |    |
>>>>>>>                |   |
>>>>>>>> |     |     +Note--------------+             |    |
>>>>>>>   +-----------------+    |   |
>>>>>>>> |     |     |Running on a      |             |    |     
       |
>>>>>> Spark
>>>>>>> Streaming |    |   |
>>>>>>>> |     |     |worker node in    |             |    |     
B. C.
>>>>>> | B.
>>>>>>> Scala        |    |   |
>>>>>>>> |     |     |the Hadoop cluster|             |    |
>>>>>> +--------> C.
>>>>>>> Scala        +-+  |   |
>>>>>>>> |     |     +------------------+             |    |    |
       |
>>>>>>>           | |  |   |
>>>>>>>> |   A.|                                      |    |    |
>>>>>>> +-----------------+ |  |   |
>>>>>>>> |   B.|                                      |    |    |
>>>>>>>              |  |   |
>>>>>>>> |   C.|                                      |    |    |
>>>>>>>              |  |   |
>>>>>>>> | +----------------------+          +-v------+----+----+-+
>>>>>>>   +--------------v--v-+ |
>>>>>>>> | |                      |          |
>>>>>> |           |
>>>>>>>                   | |
>>>>>>>> | |   Local FS:          |          |    hdfs
>>>>>> |           |
>>>>>>> Hive / Impala    | |
>>>>>>>> | |  - Binary/Text       |          |
>>>>>> |           |
>>>>>>>   - Parquet -     | |
>>>>>>>> | |    Log files -       |          |
>>>>>> |           |
>>>>>>>                   | |
>>>>>>>> | |                      |          |
>>>>>> |           |
>>>>>>>                   | |
>>>>>>>> | +----------------------+          +--------------------+
>>>>>>>   +-------------------+ |
>>>>>>>> +-----------------------------------------------------------
>>>>>>> -------------------------------+
>>>>>>>> Please let me know your thoughts,
>>>>>>>>
>>>>>>>> - Nathanael
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>


Mime
View raw message