spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sheryl John <shery...@gmail.com>
Subject Re: about a JavaWordCount example with spark-core_2.10-1.0.0.jar
Date Thu, 26 Jun 2014 18:38:21 GMT
Make sure you have the right versions for 'mongo-hadoop-core' and
'hadoop-client'. I noticed you've used hadoop-client version 2.2.0

The maven repositories does not have the mongo-hadoop connector for Hadoop
version 2.2.0, so you have to include the mongo-hadoop connector as an
unmanaged library (as mentioned in the link
http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ and example
code).
I'm not sure if this will fix the error.


On Thu, Jun 26, 2014 at 3:23 AM, Alonso Isidoro Roman <alonsoir@gmail.com>
wrote:

> Hi Sheryl,
>
> first at all, thanks for answering. spark-core_2.10 artifact has as
> dependency hadoop-client-1.0.4.jar and mongo-hadoop-core-1.0.0.jar has
> mongo-java-driver-2.7.3.jar.
>
> This is my actual pom.xml
>
> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="
> http://www.w3.org/2001/XMLSchema-instance"
>
> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
> http://maven.apache.org/xsd/maven-4.0.0.xsd">
>
> <groupId>com.aironman.spark</groupId>
>
> <artifactId>simple-project</artifactId>
>
> <modelVersion>4.0.0</modelVersion>
>
> <name>Simple Project</name>
>
> <packaging>jar</packaging>
>
> <version>1.0</version>
>
> <properties>
>
> <maven.compiler.source>1.7</maven.compiler.source>
>
> <maven.compiler.target>1.7</maven.compiler.target>
>
> <hadoop.version>1.0.2</hadoop.version>
>
> </properties>
>
>
>  <repositories>
>
> <repository>
>
> <id>Akka repository</id>
>
> <url>http://repo.akka.io/releases</url>
>
> </repository>
>
> </repositories>
>
> <dependencies>
>
> <dependency> <!-- Spark dependency -->
>
> <groupId>org.apache.spark</groupId>
>
> <artifactId>spark-core_2.10</artifactId>
>
> <version>1.0.0</version>
>
> </dependency>
>
> <dependency> <!-- Mongodb dependency -->
>
> <groupId>org.mongodb</groupId>
>
> <artifactId>mongo-hadoop-core</artifactId>
>
> <version>1.0.0</version>
>
> </dependency>
>
>
>  <dependency>
>
> <groupId>org.apache.hadoop</groupId>
>
> <artifactId>hadoop-client</artifactId>
>
> <version>2.2.0</version>
>
> </dependency>
>
> <dependency>
>
> <groupId>org.mongodb</groupId>
>
> <artifactId>mongo-java-driver</artifactId>
>
> <version>2.9.3</version>
>
> </dependency>
>
> </dependencies>
>
> <build>
>
> <outputDirectory>target/java-${maven.compiler.source}/classes</
> outputDirectory>
>
> <testOutputDirectory>target/java-${maven.compiler.source}/test-classes</
> testOutputDirectory>
>
> <plugins>
>
> <plugin>
>
> <groupId>org.apache.maven.plugins</groupId>
>
> <artifactId>maven-shade-plugin</artifactId>
>
> <version>2.3</version>
>
> <configuration>
>
>
>  <shadedArtifactAttached>false</shadedArtifactAttached>
>
>
>  <outputFile>
> ${project.build.directory}/java-${maven.compiler.source}/ConnectorSparkMongo-${project.version}-hadoop${hadoop.version}.jar
> </outputFile>
>
> <artifactSet>
>
> <includes>
>
> <include>*:*</include>
>
> </includes>
>
> </artifactSet>
>
>
>  <filters>
>
> <filter>
>
> <artifact>*:*</artifact>
>
> <!-- <excludes> <exclude>META-INF/*.SF</exclude>
> <exclude>META-INF/*.DSA</exclude>
>
> <exclude>META-INF/*.RSA</exclude> </excludes> -->
>
> </filter>
>
> </filters>
>
>
>  </configuration>
>
> <executions>
>
> <execution>
>
> <phase>package</phase>
>
> <goals>
>
> <goal>shade</goal>
>
> </goals>
>
> <configuration>
>
> <transformers>
>
> <transformer
>
> implementation=
> "org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
>
> <transformer
>
> implementation=
> "org.apache.maven.plugins.shade.resource.AppendingTransformer">
>
> <resource>reference.conf</resource>
>
> </transformer>
>
> <transformer
>
> implementation=
> "org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer">
>
> <resource>log4j.properties</resource>
>
> </transformer>
>
> </transformers>
>
> </configuration>
>
> </execution>
>
> </executions>
>
> </plugin>
>
> </plugins>
>
> </build>
>
> </project>
>
> and the error persists, this is the  spark-submit command`s output
>
> MacBook-Pro-Retina-de-Alonso:ConnectorSparkMongo aironman$
> /Users/aironman/spark-1.0.0/bin/spark-submit --class
> "com.aironman.spark.utils.JavaWordCount" --master local[4]
> target/simple-project-1.0.jar
>
> 14/06/26 12:05:21 INFO SecurityManager: Using Spark's default log4j
> profile: org/apache/spark/log4j-defaults.properties
>
> 14/06/26 12:05:21 INFO SecurityManager: Changing view acls to: aironman
>
> 14/06/26 12:05:21 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(aironman)
>
> 14/06/26 12:05:21 INFO Slf4jLogger: Slf4jLogger started
>
> 14/06/26 12:05:21 INFO Remoting: Starting remoting
>
> 14/06/26 12:05:21 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://spark@192.168.1.35:49681]
>
> 14/06/26 12:05:21 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://spark@192.168.1.35:49681]
>
> 14/06/26 12:05:21 INFO SparkEnv: Registering MapOutputTracker
>
> 14/06/26 12:05:21 INFO SparkEnv: Registering BlockManagerMaster
>
> 14/06/26 12:05:21 INFO DiskBlockManager: Created local directory at
> /var/folders/gn/pzkybyfd2g5bpyh47q0pp5nc0000gn/T/spark-local-20140626120521-09d2
>
> 14/06/26 12:05:21 INFO MemoryStore: MemoryStore started with capacity
> 294.9 MB.
>
> 14/06/26 12:05:21 INFO ConnectionManager: Bound socket to port 49682 with
> id = ConnectionManagerId(192.168.1.35,49682)
>
> 14/06/26 12:05:21 INFO BlockManagerMaster: Trying to register BlockManager
>
> 14/06/26 12:05:21 INFO BlockManagerInfo: Registering block manager
> 192.168.1.35:49682 with 294.9 MB RAM
>
> 14/06/26 12:05:21 INFO BlockManagerMaster: Registered BlockManager
>
> 14/06/26 12:05:21 INFO HttpServer: Starting HTTP Server
>
> 14/06/26 12:05:21 INFO HttpBroadcast: Broadcast server started at
> http://192.168.1.35:49683
>
> 14/06/26 12:05:21 INFO HttpFileServer: HTTP File server directory is
> /var/folders/gn/pzkybyfd2g5bpyh47q0pp5nc0000gn/T/spark-e0cff70b-9597-452f-b5db-99aef3245e04
>
> 14/06/26 12:05:21 INFO HttpServer: Starting HTTP Server
>
> 14/06/26 12:05:21 INFO SparkUI: Started SparkUI at
> http://192.168.1.35:4040
>
> 2014-06-26 12:05:21.915 java[880:1903] Unable to load realm info from
> SCDynamicStore
>
> 14/06/26 12:05:22 INFO SparkContext: Added JAR
> file:/Users/aironman/Documents/ws-spark/ConnectorSparkMongo/target/simple-project-1.0.jar
> at http://192.168.1.35:49684/jars/simple-project-1.0.jar with timestamp
> 1403777122023
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> com/mongodb/hadoop/MongoInputFormat
>
> at com.aironman.spark.utils.JavaWordCount.main(JavaWordCount.java:50)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
>
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Caused by: java.lang.ClassNotFoundException:
> com.mongodb.hadoop.MongoInputFormat
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>
> ... 8 more
>
>
> com.mongodb.hadoop.MongoInputFormat is located in mongo-hadoop-core-1.0.0.jar
> and theorycally the jar is included in uber jar because i can see in
> the mvm clean package command output
>
>
> [INFO] Including org.mongodb:mongo-hadoop-core:jar:1.0.0 in the shaded jar.
>
>
> It looks like the dependency jar is not included in the uber jar, despite
> the line shown above...
>
>
> Thanks in advance, i really appreciate it
>
> Alonso Isidoro Roman.
>
> Mis citas preferidas (de hoy) :
> "Si depurar es el proceso de quitar los errores de software, entonces
> programar debe ser el proceso de introducirlos..."
>  -  Edsger Dijkstra
>
> My favorite quotes (today):
> "If debugging is the process of removing software bugs, then programming
> must be the process of putting ..."
>   - Edsger Dijkstra
>
> "If you pay peanuts you get monkeys"
>
>
>
> 2014-06-25 18:47 GMT+02:00 Sheryl John <sheryljj@gmail.com>:
>
> Hi Alonso,
>>
>> I was able to get Spark working with MongoDB following the above
>> instructions and used SBT to manage dependencies.
>> Try adding 'hadoop-client' and 'mongo-java-driver' dependencies in your
>> pom.xml.
>>
>>
>>
>> On Tue, Jun 24, 2014 at 8:06 AM, Alonso Isidoro Roman <alonsoir@gmail.com
>> > wrote:
>>
>>> Hi Yana, thanks for the answer, of course you are right and it works! no
>>> errors, now i am trying to execute it and i am getting an error
>>>
>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>> com/mongodb/hadoop/MongoInputFormat
>>>
>>> at com.aironman.spark.utils.JavaWordCount.main(JavaWordCount.java:49)
>>>
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>
>>> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
>>>
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
>>>
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>
>>> Caused by: java.lang.ClassNotFoundException:
>>> com.mongodb.hadoop.MongoInputFormat
>>>
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>>
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>>
>>> at java.security.AccessController.doPrivileged(Native Method)
>>>
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>>
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>>
>>> ... 8 more
>>>
>>>
>>> That class is located in mongo-hadoop-core-1.0.0.jar which is located
>>> and declared in pom.xml and i am using maven-shade-plugin, this is the
>>> actual pom.xml:
>>>
>>>
>>>
>>> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="
>>> http://www.w3.org/2001/XMLSchema-instance"
>>>
>>> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
>>> http://maven.apache.org/xsd/maven-4.0.0.xsd">
>>>
>>> <groupId>com.aironman.spark</groupId>
>>>
>>> <artifactId>simple-project</artifactId>
>>>
>>> <modelVersion>4.0.0</modelVersion>
>>>
>>> <name>Simple Project</name>
>>>
>>> <packaging>jar</packaging>
>>>
>>> <version>1.0</version>
>>>
>>> <properties>
>>>
>>> <maven.compiler.source>1.7</maven.compiler.source>
>>>
>>> <maven.compiler.target>1.7</maven.compiler.target>
>>>
>>> <hadoop.version>1.0.2</hadoop.version>
>>>
>>> </properties>
>>>
>>>
>>> <repositories>
>>>
>>> <repository>
>>>
>>> <id>Akka repository</id>
>>>
>>> <url>http://repo.akka.io/releases</url>
>>>
>>> </repository>
>>>
>>> </repositories>
>>>
>>> <dependencies>
>>>
>>> <dependency> <!-- Spark dependency -->
>>>
>>> <groupId>org.apache.spark</groupId>
>>>
>>> <artifactId>spark-core_2.10</artifactId>
>>>
>>> <version>1.0.0</version>
>>>
>>> </dependency>
>>>
>>>
>>> <dependency>
>>>
>>> <groupId>org.mongodb</groupId>
>>>
>>> <artifactId>mongo-hadoop-core</artifactId>
>>>
>>> <version>1.0.0</version>
>>>
>>> </dependency>
>>>
>>>
>>> </dependencies>
>>>
>>> <build>
>>>
>>> <outputDirectory>target/java-${maven.compiler.source}/classes</
>>> outputDirectory>
>>>
>>> <testOutputDirectory>target/java-${maven.compiler.source}/test-classes</
>>> testOutputDirectory>
>>>
>>> <plugins>
>>>
>>> <plugin>
>>>
>>> <groupId>org.apache.maven.plugins</groupId>
>>>
>>> <artifactId>maven-shade-plugin</artifactId>
>>>
>>> <version>2.3</version>
>>>
>>> <configuration>
>>>
>>>  <shadedArtifactAttached>false</shadedArtifactAttached>
>>>
>>>  <outputFile>
>>> ${project.build.directory}/java-${maven.compiler.source}/ConnectorSparkMongo-${project.version}-hadoop${hadoop.version}.jar
>>> </outputFile>
>>>
>>> <artifactSet>
>>>
>>> <includes>
>>>
>>> <include>*:*</include>
>>>
>>> </includes>
>>>
>>> </artifactSet>
>>>
>>> <!--
>>>
>>>  <filters>
>>>
>>> <filter>
>>>
>>>  <artifact>*:*</artifact>
>>>
>>> <excludes>
>>>
>>> <exclude>META-INF/*.SF</exclude>
>>>
>>>  <exclude>META-INF/*.DSA</exclude>
>>>
>>> <exclude>META-INF/*.RSA</exclude>
>>>
>>> </excludes>
>>>
>>>  </filter>
>>>
>>> </filters>
>>>
>>>  -->
>>>
>>> </configuration>
>>>
>>> <executions>
>>>
>>> <execution>
>>>
>>> <phase>package</phase>
>>>
>>> <goals>
>>>
>>> <goal>shade</goal>
>>>
>>> </goals>
>>>
>>> <configuration>
>>>
>>> <transformers>
>>>
>>> <transformer
>>>
>>> implementation=
>>> "org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
>>>
>>> <transformer
>>>
>>> implementation=
>>> "org.apache.maven.plugins.shade.resource.AppendingTransformer">
>>>
>>> <resource>reference.conf</resource>
>>>
>>> </transformer>
>>>
>>> <transformer
>>>
>>> implementation=
>>> "org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer"
>>> >
>>>
>>> <resource>log4j.properties</resource>
>>>
>>> </transformer>
>>>
>>> </transformers>
>>>
>>> </configuration>
>>>
>>> </execution>
>>>
>>> </executions>
>>>
>>> </plugin>
>>>
>>> </plugins>
>>>
>>> </build>
>>>
>>> </project>
>>>
>>> It is confusing me because i can see the output of mvm package:
>>>
>>> ...
>>>
>>> [INFO] Including org.mongodb:mongo-hadoop-core:jar:1.0.0 in the shaded
>>> jar.
>>>
>>> ...
>>> in theory, the uber jar is created with all dependencies but if i open
>>> the jar i don't see them.
>>>
>>> Thanks in advance
>>>
>>>
>>> Alonso Isidoro Roman.
>>>
>>> Mis citas preferidas (de hoy) :
>>> "Si depurar es el proceso de quitar los errores de software, entonces
>>> programar debe ser el proceso de introducirlos..."
>>>  -  Edsger Dijkstra
>>>
>>> My favorite quotes (today):
>>> "If debugging is the process of removing software bugs, then programming
>>> must be the process of putting ..."
>>>   - Edsger Dijkstra
>>>
>>> "If you pay peanuts you get monkeys"
>>>
>>>
>>>
>>> 2014-06-23 21:27 GMT+02:00 Yana Kadiyska <yana.kadiyska@gmail.com>:
>>>
>>>  One thing I noticed around the place where you get the first error --
>>>> you are calling words.map instead of words.mapToPair. map produces
>>>> JavaRDD<R> whereas mapToPair gives you a JavaPairRDD. I don't use the
>>>> Java APIs myself but it looks to me like you need to check the types
>>>> more carefully.
>>>>
>>>> On Mon, Jun 23, 2014 at 2:15 PM, Alonso Isidoro Roman
>>>> <alonsoir@gmail.com> wrote:
>>>> > Hi all,
>>>> >
>>>> > I am new to Spark, so this is probably a basic question. i want to
>>>> explore
>>>> > the possibilities of this fw, concretely using it in conjunction with
>>>> 3
>>>> > party libs, like mongodb, for example.
>>>> >
>>>> > I have been keeping instructions from
>>>> > http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ in order
>>>> to
>>>> > connect spark with mongodb. This example is made with
>>>> > spark-core_2.1.0-0.9.0.jar so first thing i did is to update pom.xml
>>>> with
>>>> > latest versions.
>>>> >
>>>> > This is my pom.xml
>>>> >
>>>> > <project xmlns="http://maven.apache.org/POM/4.0.0"
>>>> > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>>> >
>>>> > xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
>>>> > http://maven.apache.org/xsd/maven-4.0.0.xsd">
>>>> >
>>>> > <groupId>com.aironman.spark</groupId>
>>>> >
>>>> > <artifactId>simple-project</artifactId>
>>>> >
>>>> > <modelVersion>4.0.0</modelVersion>
>>>> >
>>>> > <name>Simple Project</name>
>>>> >
>>>> > <packaging>jar</packaging>
>>>> >
>>>> > <version>1.0</version>
>>>> >
>>>> > <repositories>
>>>> >
>>>> > <repository>
>>>> >
>>>> > <id>Akka repository</id>
>>>> >
>>>> > <url>http://repo.akka.io/releases</url>
>>>> >
>>>> > </repository>
>>>> >
>>>> > </repositories>
>>>> >
>>>> > <dependencies>
>>>> >
>>>> > <dependency> <!-- Spark dependency -->
>>>> >
>>>> > <groupId>org.apache.spark</groupId>
>>>> >
>>>> > <artifactId>spark-core_2.10</artifactId>
>>>> >
>>>> > <version>1.0.0</version>
>>>> >
>>>> > </dependency>
>>>> >
>>>> >
>>>> > <dependency>
>>>> >
>>>> > <groupId>org.mongodb</groupId>
>>>> >
>>>> > <artifactId>mongo-hadoop-core</artifactId>
>>>> >
>>>> > <version>1.0.0</version>
>>>> >
>>>> > </dependency>
>>>> >
>>>> >
>>>> > </dependencies>
>>>> >
>>>> > </project>
>>>> >
>>>> > As you can see, super simple pom.xml
>>>> >
>>>> > And this is the JavaWordCount.java
>>>> >
>>>> > import java.util.Arrays;
>>>> >
>>>> > import java.util.Collections;
>>>> >
>>>> >
>>>> > import org.apache.hadoop.conf.Configuration;
>>>> >
>>>> > import org.apache.spark.api.java.JavaPairRDD;
>>>> >
>>>> > import org.apache.spark.api.java.JavaRDD;
>>>> >
>>>> > import org.apache.spark.api.java.JavaSparkContext;
>>>> >
>>>> > import org.apache.spark.api.java.function.FlatMapFunction;
>>>> >
>>>> > import org.apache.spark.api.java.function.Function2;
>>>> >
>>>> > import org.apache.spark.api.java.function.PairFunction;
>>>> >
>>>> > import org.bson.BSONObject;
>>>> >
>>>> > import org.bson.BasicBSONObject;
>>>> >
>>>> >
>>>> > import scala.Tuple2;
>>>> >
>>>> >
>>>> > import com.mongodb.hadoop.MongoOutputFormat;
>>>> >
>>>> >
>>>> > /***
>>>> >
>>>> >  * Esta clase se supone que se conecta a un cluster mongodb para
>>>> ejecutar
>>>> > una tarea word count por cada palabra almacenada en la bd.
>>>> >
>>>> >  * el problema es que esta api esta rota, creo. Estoy usando la ultima
>>>> > version del fw, la 1.0.0. Esto no es spark-streaming, lo suyo seria
>>>> usar un
>>>> > ejemplo
>>>> >
>>>> >  * sobre spark-streaming conectandose a un base mongodb, o usar
>>>> > spark-streaming junto con spring integration, es decir, conectar
>>>> spark con
>>>> > un servicio web que
>>>> >
>>>> >  * periodicamente alimentaria spark...
>>>> >
>>>> >  * @author aironman
>>>> >
>>>> >  *
>>>> >
>>>> >  */
>>>> >
>>>> > public class JavaWordCount {
>>>> >
>>>> >
>>>> >
>>>> >     public static void main(String[] args) {
>>>> >
>>>> >
>>>> >
>>>> >         JavaSparkContext sc = new JavaSparkContext("local", "Java Word
>>>> > Count");
>>>> >
>>>> >
>>>> >
>>>> >         Configuration config = new Configuration();
>>>> >
>>>> >         config.set("mongo.input.uri",
>>>> > "mongodb:127.0.0.1:27017/beowulf.input");
>>>> >
>>>> >         config.set("mongo.output.uri",
>>>> > "mongodb:127.0.0.1:27017/beowulf.output");
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >         JavaPairRDD<Object, BSONObject> mongoRDD =
>>>> > sc.newAPIHadoopRDD(config, com.mongodb.hadoop.MongoInputFormat.class,
>>>> > Object.class, BSONObject.class);
>>>> >
>>>> >
>>>> >
>>>> > //         Input contains tuples of (ObjectId, BSONObject)
>>>> >
>>>> >         JavaRDD<String> words = mongoRDD.flatMap(new
>>>> > FlatMapFunction<Tuple2<Object, BSONObject>, String>() {
>>>> >
>>>> >             @Override
>>>> >
>>>> >             public Iterable<String> call(Tuple2<Object, BSONObject>
>>>> arg) {
>>>> >
>>>> >                 Object o = arg._2.get("text");
>>>> >
>>>> >                 if (o instanceof String) {
>>>> >
>>>> >                     String str = (String) o;
>>>> >
>>>> >                     str = str.toLowerCase().replaceAll("[.,!?\n]", "
>>>> ");
>>>> >
>>>> >                     return Arrays.asList(str.split(" "));
>>>> >
>>>> >                 } else {
>>>> >
>>>> >                     return Collections.emptyList();
>>>> >
>>>> >                 }
>>>> >
>>>> >             }
>>>> >
>>>> >         });
>>>> >
>>>> >         //here is an error, The method map(Function<String,R>)
in the
>>>> type
>>>> > JavaRDD<String> is not applicable for the arguments (new
>>>> > PairFunction<String,String,Integer>(){})
>>>> >
>>>> >         JavaPairRDD<String, Integer> ones = words.map(new
>>>> > PairFunction<String, String, Integer>() {
>>>> >
>>>> >             public Tuple2<String, Integer> call(String s) {
>>>> >
>>>> >                 return new Tuple2<>(s, 1);
>>>> >
>>>> >             }
>>>> >
>>>> >         });
>>>> >
>>>> >         JavaPairRDD<String, Integer> counts = ones.reduceByKey(new
>>>> > Function2<Integer, Integer, Integer>() {
>>>> >
>>>> >             public Integer call(Integer i1, Integer i2) {
>>>> >
>>>> >                 return i1 + i2;
>>>> >
>>>> >             }
>>>> >
>>>> >         });
>>>> >
>>>> >
>>>> >
>>>> >         //another error, The method
>>>> map(Function<Tuple2<String,Integer>,R>)
>>>> > in the type JavaPairRDD<String,Integer> is not applicable for
the
>>>> arguments
>>>> > (new    //PairFunction<Tuple2<String,Integer>,Object,BSONObject>(){})
>>>> >
>>>> >
>>>> > //         Output contains tuples of (null, BSONObject) - ObjectId
>>>> will be
>>>> > generated by Mongo driver if null
>>>> >
>>>> >         JavaPairRDD<Object, BSONObject> save = counts.map(new
>>>> > PairFunction<Tuple2<String, Integer>, Object, BSONObject>()
{
>>>> >
>>>> >             @Override
>>>> >
>>>> >             public Tuple2<Object, BSONObject> call(Tuple2<String,
>>>> Integer>
>>>> > tuple) {
>>>> >
>>>> >                 BSONObject bson = new BasicBSONObject();
>>>> >
>>>> >                 bson.put("word", tuple._1);
>>>> >
>>>> >                 bson.put("count", tuple._2);
>>>> >
>>>> >                 return new Tuple2<>(null, bson);
>>>> >
>>>> >             }
>>>> >
>>>> >         });
>>>> >
>>>> >
>>>> >
>>>> > //         Only MongoOutputFormat and config are relevant
>>>> >
>>>> >         save.saveAsNewAPIHadoopFile("file:/bogus", Object.class,
>>>> > Object.class, MongoOutputFormat.class, config);
>>>> >
>>>> >     }
>>>> >
>>>> >
>>>> >
>>>> > }
>>>> >
>>>> >
>>>> > It looks like jar hell dependency, isn't it?  can anyone guide or
>>>> help me?
>>>> >
>>>> > Another thing, i don t like closures, is it possible to use this fw
>>>> without
>>>> > using it?
>>>> > Another question, are this objects, JavaSparkContext sc,
>>>> JavaPairRDD<Object,
>>>> > BSONObject> mongoRDD ThreadSafe? Can i use them as singleton?
>>>> >
>>>> > Thank you very much and apologizes if the questions are not trending
>>>> topic
>>>> > :)
>>>> >
>>>> > Alonso Isidoro Roman.
>>>> >
>>>> > Mis citas preferidas (de hoy) :
>>>> > "Si depurar es el proceso de quitar los errores de software, entonces
>>>> > programar debe ser el proceso de introducirlos..."
>>>> >  -  Edsger Dijkstra
>>>> >
>>>> > My favorite quotes (today):
>>>> > "If debugging is the process of removing software bugs, then
>>>> programming
>>>> > must be the process of putting ..."
>>>> >   - Edsger Dijkstra
>>>> >
>>>> > "If you pay peanuts you get monkeys"
>>>> >
>>>>
>>>
>>>
>>
>>
>> --
>> -Sheryl
>>
>
>


-- 
-Sheryl

Mime
View raw message