spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: Slow Mongo Read from Spark
Date Mon, 31 Aug 2015 07:33:50 GMT
Here's a piece of code which works well for us (spark 1.4.1)

        Configuration bsonDataConfig = new Configuration();
        bsonDataConfig.set("mongo.job.input.format",
"com.mongodb.hadoop.BSONFileInputFormat");

        Configuration predictionsConfig = new Configuration();
        predictionsConfig.set("mongo.output.uri", mongodbUri);

        JavaPairRDD<Object,BSONObject> bsonRatingsData =
sc.newAPIHadoopFile(
            ratingsUri, BSONFileInputFormat.class, Object.class,
                BSONObject.class, bsonDataConfig);


Thanks
Best Regards

On Mon, Aug 31, 2015 at 12:59 PM, Deepesh Maheshwari <
deepesh.maheshwari17@gmail.com> wrote:

> Hi, I am using <spark.version>1.3.0</spark.version>
>
> I am not getting constructor for above values
>
> [image: Inline image 1]
>
> So, i tried to shuffle the values in constructor .
> [image: Inline image 2]
>
> But, it is giving this error.Please suggest
> [image: Inline image 3]
>
> Best Regards
>
> On Mon, Aug 31, 2015 at 12:43 PM, Akhil Das <akhil@sigmoidanalytics.com>
> wrote:
>
>> Can you try with these key value classes and see the performance?
>>
>> inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"
>>
>>
>> keyClassName = "org.apache.hadoop.io.Text"
>> valueClassName = "org.apache.hadoop.io.MapWritable"
>>
>>
>> Taken from databricks blog
>> <https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html>
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Aug 31, 2015 at 12:26 PM, Deepesh Maheshwari <
>> deepesh.maheshwari17@gmail.com> wrote:
>>
>>> Hi, I am trying to read mongodb in Spark newAPIHadoopRDD.
>>>
>>> /**** Code *****/
>>>
>>> config.set("mongo.job.input.format",
>>> "com.mongodb.hadoop.MongoInputFormat");
>>> config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI);
>>> config.set("mongo.input.query","{host: 'abc.com'}");
>>>
>>> JavaSparkContext sc=new JavaSparkContext("local", "MongoOps");
>>>
>>>         JavaPairRDD<Object, BSONObject> mongoRDD =
>>> sc.newAPIHadoopRDD(config,
>>>                 com.mongodb.hadoop.MongoInputFormat.class, Object.class,
>>>                 BSONObject.class);
>>>
>>>         long count=mongoRDD.count();
>>>
>>> There are about 1.5million record.
>>> Though i am getting data but read operation took around 15min to read
>>> whole.
>>>
>>> Is this Api really too slow or am i missing something.
>>> Please suggest if there is an alternate approach to read data from Mongo
>>> faster.
>>>
>>> Thanks,
>>> Deepesh
>>>
>>
>>
>

Mime
View raw message