spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 宿荣全 (JIRA) <j...@apache.org>
Subject [jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD
Date Thu, 11 Dec 2014 03:55:12 GMT

    [ https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242090#comment-14242090
] 

宿荣全 commented on SPARK-4817:
----------------------------

 [~srowen]
' Neither prints the "top" elements. Did you mean "first"?'
yes print first 'num' datas.

print and foreachRDD ultimately call {new ForEachDStream(this, context.sparkContext.clean(foreachFunc)).register()}.
The difference:
print's 'foreachFunc' is defined by streaming,foreachRDD's 'foreachFunc' is defined by developer.I
think this method that "always call foreachRDD, and operate on all of the RDD, and then call
take on the RDD to get a few elements to print." is the same as print,and do print function
only,don't handle  all elements in RDD.
for example:
1.val dstream = stream.map->filter->.foreachRDD(rdd => {
      val result = rdd.take(11)
      result foreach println
    })
2.val dstream = stream.map->filter->print
both of this two example all handle  11 datas,1 is println 11 datas. 2 is println 10datas
and ""...".

So if want to handle all elements in RDD and print 'num' datas I thank this patch is very
convenient and necessary.



> [streaming]Print the specified number of data and handle all of the elements in RDD
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-4817
>                 URL: https://issues.apache.org/jira/browse/SPARK-4817
>             Project: Spark
>          Issue Type: New Feature
>          Components: Streaming
>            Reporter: 宿荣全
>            Priority: Minor
>
> Dstream.print function:Print 10 elements and handle 11 elements.
> A new function based on Dstream.print function is presented:
> the new function:
> Print the specified number of data and handle all of the elements in RDD.
> there is a work scene:
> val dstream = stream.map->filter->mapPartitions->print
> the data after filter need update database in mapPartitions,but don't need print each
data,only need to print the top 20 for view the data processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message