spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Ogren <philip.og...@oracle.com>
Subject Re: Where is reduceByKey?
Date Thu, 07 Nov 2013 23:46:25 GMT
Thanks - I think this would be a helpful note to add to the docs.  I 
went and read a few things about Scala implicit conversions (I'm 
obviously new to the language) and it seems like a very powerful 
language feature and now that I know about them it will certainly be 
easy to identify when they are missing (i.e. the first thing to suspect 
when you see a "not a member" compilation message.)  I'm still a bit 
mystified as to how you would go about finding the appropriate imports 
except that I suppose you aren't very likely to use methods that you 
don't already know about!  Unless you are copying code verbatim that 
doesn't have the necessary import statements....


On 11/7/2013 4:05 PM, Matei Zaharia wrote:
> Yeah, this is confusing and unfortunately as far as I know it’s API 
> specific. Maybe we should add this to the documentation page for RDD.
>
> The reason for these conversions is to only allow some operations 
> based on the underlying data type of the collection. For example, 
> Scala collections support sum() as long as they contain numeric types. 
> That’s fine for the Scala collection library since its conversions are 
> imported by default, but I guess it makes it confusing for third-party 
> apps.
>
> Matei
>
> On Nov 7, 2013, at 1:15 PM, Philip Ogren <philip.ogren@oracle.com 
> <mailto:philip.ogren@oracle.com>> wrote:
>
>> I remember running into something very similar when trying to perform 
>> a foreach on java.util.List and I fixed it by adding the following 
>> import:
>>
>> import scala.collection.JavaConversions._
>>
>> And my foreach loop magically compiled - presumably due to a another 
>> implicit conversion.  Now this is the second time I've run into this 
>> problem and I didn't recognize it.  I'm not sure that I would know 
>> what to do the next time I run into this.  Do you have some advice on 
>> how I should have recognized a missing import that provides implicit 
>> conversions and how I would know what to import?  This strikes me as 
>> code obfuscation.  I guess this is more of a Scala question....
>>
>> Thanks,
>> Philip
>>
>>
>>
>> On 11/7/2013 2:01 PM, Josh Rosen wrote:
>>> The additional methods on RDDs of pairs are defined in a class 
>>> called PairRDDFunctions 
>>> (https://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions).

>>>  SparkContext provides an implicit conversion from RDD[T] to 
>>> PairRDDFunctions[T] to make this transparent to users.
>>>
>>> To import those implicit conversions, use
>>>
>>>     import org.apache.spark.SparkContext._
>>>
>>>
>>> These conversions are automatically imported by Spark Shell, but 
>>> you'll have to import them yourself in standalone programs.
>>>
>>>
>>> On Thu, Nov 7, 2013 at 11:54 AM, Philip Ogren 
>>> <philip.ogren@oracle.com <mailto:philip.ogren@oracle.com>> wrote:
>>>
>>>     On the front page <http://spark.incubator.apache.org/> of the
>>>     Spark website there is the following simple word count
>>>     implementation:
>>>
>>>     file = spark.textFile("hdfs://...")
>>>     file.flatMap(line => line.split(" ")).map(word => (word,
>>>     1)).reduceByKey(_ + _)
>>>
>>>     The same code can be found in the Quick Start
>>>     <http://spark.incubator.apache.org/docs/latest/quick-start.html>
>>>     quide.  When I follow the steps in my spark-shell (version
>>>     0.8.0) it works fine.  The reduceByKey method is also shown in
>>>     the list of transformations
>>>     <http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#transformations>
>>>     in the Spark Programming Guide.  The bottom of this list directs
>>>     the reader to the API docs for the class RDD (this link is
>>>     broken, BTW). The API docs for RDD
>>>     <http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD>
>>>     does not list a reduceByKey method for RDD.  Also, when I try to
>>>     compile the above code in a Scala class definition I get the
>>>     following compile error:
>>>
>>>     value reduceByKey is not a member of
>>>     org.apache.spark.rdd.RDD[(java.lang.String, Int)]
>>>
>>>     I am compiling with maven using the following dependency definition:
>>>
>>>             <dependency>
>>>     <groupId>org.apache.spark</groupId>
>>>     <artifactId>spark-core_2.9.3</artifactId>
>>>     <version>0.8.0-incubating</version>
>>>             </dependency>
>>>
>>>     Can someone help me understand why this code works fine from the
>>>     spark-shell but doesn't seem to exist in the API docs and won't
>>>     compile?
>>>
>>>     Thanks,
>>>     Philip
>>>
>>>
>>
>


Mime
View raw message