spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: Slow Mongo Read from Spark
Date Mon, 31 Aug 2015 08:34:25 GMT
FYI, newAPIHadoopFile and newAPIHadoopRDD uses the NewHadoopRDD class
itself underneath and it doesnt mean it will only read from HDFS. Give it a
shot if you haven't tried it already (it just the inputformat and the
reader which are different from your approach).

Thanks
Best Regards

On Mon, Aug 31, 2015 at 1:14 PM, Deepesh Maheshwari <
deepesh.maheshwari17@gmail.com> wrote:

> Hi Akhil,
>
> This code snippet is from below link
>
> https://github.com/crcsmnky/mongodb-spark-demo/blob/master/src/main/java/com/mongodb/spark/demo/Recommender.java
>
> Here it reading data from HDFS file system but in our case i need to read
> from mongodb.
>
> I have tried it earlier and now again tried it but is giving below error
> which is self explanantory.
>
> Exception in thread "main" java.io.IOException: No FileSystem for scheme:
> mongodb
>
> On Mon, Aug 31, 2015 at 1:03 PM, Akhil Das <akhil@sigmoidanalytics.com>
> wrote:
>
>> Here's a piece of code which works well for us (spark 1.4.1)
>>
>>         Configuration bsonDataConfig = new Configuration();
>>         bsonDataConfig.set("mongo.job.input.format",
>> "com.mongodb.hadoop.BSONFileInputFormat");
>>
>>         Configuration predictionsConfig = new Configuration();
>>         predictionsConfig.set("mongo.output.uri", mongodbUri);
>>
>>         JavaPairRDD<Object,BSONObject> bsonRatingsData =
>> sc.newAPIHadoopFile(
>>             ratingsUri, BSONFileInputFormat.class, Object.class,
>>                 BSONObject.class, bsonDataConfig);
>>
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Aug 31, 2015 at 12:59 PM, Deepesh Maheshwari <
>> deepesh.maheshwari17@gmail.com> wrote:
>>
>>> Hi, I am using <spark.version>1.3.0</spark.version>
>>>
>>> I am not getting constructor for above values
>>>
>>> [image: Inline image 1]
>>>
>>> So, i tried to shuffle the values in constructor .
>>> [image: Inline image 2]
>>>
>>> But, it is giving this error.Please suggest
>>> [image: Inline image 3]
>>>
>>> Best Regards
>>>
>>> On Mon, Aug 31, 2015 at 12:43 PM, Akhil Das <akhil@sigmoidanalytics.com>
>>> wrote:
>>>
>>>> Can you try with these key value classes and see the performance?
>>>>
>>>> inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"
>>>>
>>>>
>>>> keyClassName = "org.apache.hadoop.io.Text"
>>>> valueClassName = "org.apache.hadoop.io.MapWritable"
>>>>
>>>>
>>>> Taken from databricks blog
>>>> <https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html>
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Mon, Aug 31, 2015 at 12:26 PM, Deepesh Maheshwari <
>>>> deepesh.maheshwari17@gmail.com> wrote:
>>>>
>>>>> Hi, I am trying to read mongodb in Spark newAPIHadoopRDD.
>>>>>
>>>>> /**** Code *****/
>>>>>
>>>>> config.set("mongo.job.input.format",
>>>>> "com.mongodb.hadoop.MongoInputFormat");
>>>>> config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI);
>>>>> config.set("mongo.input.query","{host: 'abc.com'}");
>>>>>
>>>>> JavaSparkContext sc=new JavaSparkContext("local", "MongoOps");
>>>>>
>>>>>         JavaPairRDD<Object, BSONObject> mongoRDD =
>>>>> sc.newAPIHadoopRDD(config,
>>>>>                 com.mongodb.hadoop.MongoInputFormat.class,
>>>>> Object.class,
>>>>>                 BSONObject.class);
>>>>>
>>>>>         long count=mongoRDD.count();
>>>>>
>>>>> There are about 1.5million record.
>>>>> Though i am getting data but read operation took around 15min to read
>>>>> whole.
>>>>>
>>>>> Is this Api really too slow or am i missing something.
>>>>> Please suggest if there is an alternate approach to read data from
>>>>> Mongo faster.
>>>>>
>>>>> Thanks,
>>>>> Deepesh
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message