Hi Akhil,

This code snippet is from below link
https://github.com/crcsmnky/mongodb-spark-demo/blob/master/src/main/java/com/mongodb/spark/demo/Recommender.java

Here it reading data from HDFS file system but in our case i need to read from mongodb.

I have tried it earlier and now again tried it but is giving below error which is self explanantory.

Exception in thread "main" java.io.IOException: No FileSystem for scheme: mongodb

On Mon, Aug 31, 2015 at 1:03 PM, Akhil Das <akhil@sigmoidanalytics.com> wrote:
Here's a piece of code which works well for us (spark 1.4.1)

        Configuration bsonDataConfig = new Configuration();
        bsonDataConfig.set("mongo.job.input.format", "com.mongodb.hadoop.BSONFileInputFormat");

        Configuration predictionsConfig = new Configuration();
        predictionsConfig.set("mongo.output.uri", mongodbUri);

        JavaPairRDD<Object,BSONObject> bsonRatingsData = sc.newAPIHadoopFile(
            ratingsUri, BSONFileInputFormat.class, Object.class,
                BSONObject.class, bsonDataConfig);


Thanks
Best Regards

On Mon, Aug 31, 2015 at 12:59 PM, Deepesh Maheshwari <deepesh.maheshwari17@gmail.com> wrote:
Hi, I am using <spark.version>1.3.0</spark.version>

I am not getting constructor for above values

Inline image 1

So, i tried to shuffle the values in constructor .
Inline image 2

But, it is giving this error.Please suggest
Inline image 3

Best Regards

On Mon, Aug 31, 2015 at 12:43 PM, Akhil Das <akhil@sigmoidanalytics.com> wrote:
Can you try with these key value classes and see the performance? 

inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"

keyClassName = "org.apache.hadoop.io.Text"
valueClassName = "org.apache.hadoop.io.MapWritable"

Taken from databricks blog

Thanks
Best Regards

On Mon, Aug 31, 2015 at 12:26 PM, Deepesh Maheshwari <deepesh.maheshwari17@gmail.com> wrote:
Hi, I am trying to read mongodb in Spark newAPIHadoopRDD.

/**** Code *****/

config.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat");
config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI);
config.set("mongo.input.query","{host: 'abc.com'}");

JavaSparkContext sc=new JavaSparkContext("local", "MongoOps");

        JavaPairRDD<Object, BSONObject> mongoRDD = sc.newAPIHadoopRDD(config,
                com.mongodb.hadoop.MongoInputFormat.class, Object.class,
                BSONObject.class);
       
        long count=mongoRDD.count();

There are about 1.5million record.
Though i am getting data but read operation took around 15min to read whole.

Is this Api really too slow or am i missing something.
Please suggest if there is an alternate approach to read data from Mongo faster.

Thanks,
Deepesh