spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deepesh Maheshwari <deepesh.maheshwar...@gmail.com>
Subject Re: Slow Mongo Read from Spark
Date Mon, 31 Aug 2015 07:44:47 GMT
Hi Akhil,

This code snippet is from below link
https://github.com/crcsmnky/mongodb-spark-demo/blob/master/src/main/java/com/mongodb/spark/demo/Recommender.java

Here it reading data from HDFS file system but in our case i need to read
from mongodb.

I have tried it earlier and now again tried it but is giving below error
which is self explanantory.

Exception in thread "main" java.io.IOException: No FileSystem for scheme:
mongodb

On Mon, Aug 31, 2015 at 1:03 PM, Akhil Das <akhil@sigmoidanalytics.com>
wrote:

> Here's a piece of code which works well for us (spark 1.4.1)
>
>         Configuration bsonDataConfig = new Configuration();
>         bsonDataConfig.set("mongo.job.input.format",
> "com.mongodb.hadoop.BSONFileInputFormat");
>
>         Configuration predictionsConfig = new Configuration();
>         predictionsConfig.set("mongo.output.uri", mongodbUri);
>
>         JavaPairRDD<Object,BSONObject> bsonRatingsData =
> sc.newAPIHadoopFile(
>             ratingsUri, BSONFileInputFormat.class, Object.class,
>                 BSONObject.class, bsonDataConfig);
>
>
> Thanks
> Best Regards
>
> On Mon, Aug 31, 2015 at 12:59 PM, Deepesh Maheshwari <
> deepesh.maheshwari17@gmail.com> wrote:
>
>> Hi, I am using <spark.version>1.3.0</spark.version>
>>
>> I am not getting constructor for above values
>>
>> [image: Inline image 1]
>>
>> So, i tried to shuffle the values in constructor .
>> [image: Inline image 2]
>>
>> But, it is giving this error.Please suggest
>> [image: Inline image 3]
>>
>> Best Regards
>>
>> On Mon, Aug 31, 2015 at 12:43 PM, Akhil Das <akhil@sigmoidanalytics.com>
>> wrote:
>>
>>> Can you try with these key value classes and see the performance?
>>>
>>> inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"
>>>
>>>
>>> keyClassName = "org.apache.hadoop.io.Text"
>>> valueClassName = "org.apache.hadoop.io.MapWritable"
>>>
>>>
>>> Taken from databricks blog
>>> <https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html>
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Mon, Aug 31, 2015 at 12:26 PM, Deepesh Maheshwari <
>>> deepesh.maheshwari17@gmail.com> wrote:
>>>
>>>> Hi, I am trying to read mongodb in Spark newAPIHadoopRDD.
>>>>
>>>> /**** Code *****/
>>>>
>>>> config.set("mongo.job.input.format",
>>>> "com.mongodb.hadoop.MongoInputFormat");
>>>> config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI);
>>>> config.set("mongo.input.query","{host: 'abc.com'}");
>>>>
>>>> JavaSparkContext sc=new JavaSparkContext("local", "MongoOps");
>>>>
>>>>         JavaPairRDD<Object, BSONObject> mongoRDD =
>>>> sc.newAPIHadoopRDD(config,
>>>>                 com.mongodb.hadoop.MongoInputFormat.class, Object.class,
>>>>                 BSONObject.class);
>>>>
>>>>         long count=mongoRDD.count();
>>>>
>>>> There are about 1.5million record.
>>>> Though i am getting data but read operation took around 15min to read
>>>> whole.
>>>>
>>>> Is this Api really too slow or am i missing something.
>>>> Please suggest if there is an alternate approach to read data from
>>>> Mongo faster.
>>>>
>>>> Thanks,
>>>> Deepesh
>>>>
>>>
>>>
>>
>

Mime
View raw message