spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From madhu phatak <phatak....@gmail.com>
Subject Re: MappedStream vs Transform API
Date Tue, 17 Mar 2015 09:08:52 GMT
Hi,
Regards,
Madhukara Phatak
http://datamantra.io/

On Tue, Mar 17, 2015 at 2:31 PM, Tathagata Das <tdas@databricks.com> wrote:

> That's not super essential, and hence hasn't been done till now. Even in
> core Spark there are MappedRDD, etc. even though all of them can be
> implemented by MapPartitionedRDD (may be the name is wrong). So its nice to
> maintain the consistency, MappedDStream creates MappedRDDs. :)
> Though this does not eliminate the possibility that we will do it. Maybe
> in future, if we find that maintaining these different DStreams is becoming
> a maintenance burden (its isn't yet), we may collapse them to use
> transform. We did so in the python API for exactly this reason.
>

  Ok. When I was going through source code it confused me to understand
what were right extension points were. So I thought whoever go   through
the code may get into same situation.  But if it's not super essential then
ok.

>
> If you are interested in contributing to Spark Streaming, i can point you
> to a number of issues where your contributions will be more valuable.
>

   Yes please.

>
> TD
>
> On Tue, Mar 17, 2015 at 1:56 AM, madhu phatak <phatak.dev@gmail.com>
> wrote:
>
>> Hi,
>>  Thank you for the  response.
>>
>>  Can I give a PR to use transform for all the functions like map,flatMap
>> etc so they are consistent with other API's?.
>>
>> Regards,
>> Madhukara Phatak
>> http://datamantra.io/
>>
>> On Mon, Mar 16, 2015 at 11:42 PM, Tathagata Das <tdas@databricks.com>
>> wrote:
>>
>>> It's mostly for legacy reasons. First we had added all the
>>> MappedDStream, etc. and then later we realized we need to expose something
>>> that is more generic for arbitrary RDD-RDD transformations. It can be
>>> easily replaced. However, there is a slight value in having MappedDStream,
>>> for developers to learn about DStreams.
>>>
>>> TD
>>>
>>> On Mon, Mar 16, 2015 at 3:37 AM, madhu phatak <phatak.dev@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>  Thanks for the response. I understand that part. But I am asking why
>>>> the internal implementation using a subclass when it can use an existing
>>>> api? Unless there is a real difference, it feels like code smell to me.
>>>>
>>>>
>>>> Regards,
>>>> Madhukara Phatak
>>>> http://datamantra.io/
>>>>
>>>> On Mon, Mar 16, 2015 at 2:14 PM, Shao, Saisai <saisai.shao@intel.com>
>>>> wrote:
>>>>
>>>>>  I think these two ways are both OK for you to write streaming job,
>>>>> `transform` is a more general way for you to transform from one DStream
to
>>>>> another if there’s no related DStream API (but have related RDD API).
But
>>>>> using map maybe more straightforward and easy to understand.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Jerry
>>>>>
>>>>>
>>>>>
>>>>> *From:* madhu phatak [mailto:phatak.dev@gmail.com]
>>>>> *Sent:* Monday, March 16, 2015 4:32 PM
>>>>> *To:* user@spark.apache.org
>>>>> *Subject:* MappedStream vs Transform API
>>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>>   Current implementation of map function in spark streaming looks as
>>>>> below.
>>>>>
>>>>>
>>>>>
>>>>>   *def *map[U: ClassTag](mapFunc: T => U): DStream[U] = {
>>>>>
>>>>>   *new *MappedDStream(*this*, context.sparkContext.clean(mapFunc))
>>>>> }
>>>>>
>>>>>  It creates an instance of MappedDStream which is a subclass of
>>>>> DStream.
>>>>>
>>>>>
>>>>>
>>>>> The same function can be also implemented using transform API
>>>>>
>>>>>
>>>>>
>>>>> *def map*[U: ClassTag](mapFunc: T => U): DStream[U] =
>>>>>
>>>>> this.transform(rdd => {
>>>>>
>>>>>   rdd.map(mapFunc)
>>>>> })
>>>>>
>>>>>
>>>>>
>>>>> Both implementation looks same. If they are same, is there any
>>>>> advantage having a subclass of DStream?. Why can't we just use transform
>>>>> API?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Madhukara Phatak
>>>>> http://datamantra.io/
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message