spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Where is reduceByKey?
Date Thu, 07 Nov 2013 23:05:28 GMT
Yeah, this is confusing and unfortunately as far as I know it’s API specific. Maybe we should
add this to the documentation page for RDD.

The reason for these conversions is to only allow some operations based on the underlying
data type of the collection. For example, Scala collections support sum() as long as they
contain numeric types. That’s fine for the Scala collection library since its conversions
are imported by default, but I guess it makes it confusing for third-party apps.

Matei

On Nov 7, 2013, at 1:15 PM, Philip Ogren <philip.ogren@oracle.com> wrote:

> I remember running into something very similar when trying to perform a foreach on java.util.List
and I fixed it by adding the following import:
> 
> import scala.collection.JavaConversions._
> 
> And my foreach loop magically compiled - presumably due to a another implicit conversion.
 Now this is the second time I've run into this problem and I didn't recognize it.  I'm not
sure that I would know what to do the next time I run into this.  Do you have some advice
on how I should have recognized a missing import that provides implicit conversions and how
I would know what to import?  This strikes me as code obfuscation.  I guess this is more of
a Scala question....
> 
> Thanks,
> Philip
> 
> 
> 
> On 11/7/2013 2:01 PM, Josh Rosen wrote:
>> The additional methods on RDDs of pairs are defined in a class called PairRDDFunctions
(https://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions).
 SparkContext provides an implicit conversion from RDD[T] to PairRDDFunctions[T] to make this
transparent to users.
>> 
>> To import those implicit conversions, use
>> 
>> import org.apache.spark.SparkContext._
>> 
>> These conversions are automatically imported by Spark Shell, but you'll have to import
them yourself in standalone programs.
>> 
>> 
>> On Thu, Nov 7, 2013 at 11:54 AM, Philip Ogren <philip.ogren@oracle.com> wrote:
>> On the front page of the Spark website there is the following simple word count implementation:
>> 
>> file = spark.textFile("hdfs://...")
>> file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_
+ _)
>> 
>> The same code can be found in the Quick Start quide.  When I follow the steps in
my spark-shell (version 0.8.0) it works fine.  The reduceByKey method is also shown in the
list of transformations in the Spark Programming Guide.  The bottom of this list directs the
reader to the API docs for the class RDD (this link is broken, BTW).  The API docs for RDD
does not list a reduceByKey method for RDD.  Also, when I try to compile the above code in
a Scala class definition I get the following compile error:
>> 
>> value reduceByKey is not a member of org.apache.spark.rdd.RDD[(java.lang.String,
Int)]
>> 
>> I am compiling with maven using the following dependency definition:
>> 
>>         <dependency>
>>             <groupId>org.apache.spark</groupId>
>>             <artifactId>spark-core_2.9.3</artifactId>
>>             <version>0.8.0-incubating</version>
>>         </dependency>
>> 
>> Can someone help me understand why this code works fine from the spark-shell but
doesn't seem to exist in the API docs and won't compile?  
>> 
>> Thanks,
>> Philip
>> 
> 


Mime
View raw message