hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (MAPREDUCE-5820) Unable to process mongodb gridfs collection data in Hadoop Mapreduce
Date Wed, 09 Apr 2014 09:49:15 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Steve Loughran resolved MAPREDUCE-5820.

    Resolution: Invalid

closing as invalid. 
# this isn't a bug, it's a support issue you should raise on a mailing list
# you're asking support questions related to Mongo DB and Spark APIs. 


> Unable to process mongodb gridfs collection data in Hadoop Mapreduce
> --------------------------------------------------------------------
>                 Key: MAPREDUCE-5820
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5820
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task
>    Affects Versions: 2.2.0
>         Environment: Hadoop, Mongodb
>            Reporter: sivaram
>            Priority: Critical
> I saved a 2GB pdf file into MongoDB using GridFS. now i want process those GridFS collection
data using Java Spark Mapreduce. previously i have succesfully processed mongoDB collections
with Hadoop mapreduce using Mongo-Hadoop connector. now i'm unable to handle binary data which
is coming from input GridFS collections.
>  MongoConfigUtil.setInputURI(config, "mongodb://localhost:27017/pdfbooks.fs.chunks" );
>  MongoConfigUtil.setOutputURI(config,"mongodb://localhost:27017/"+output );
>  JavaPairRDD<Object, BSONObject> mongoRDD = sc.newAPIHadoopRDD(config,
>             com.mongodb.hadoop.MongoInputFormat.class, Object.class,
>             BSONObject.class);
>  JavaRDD<String> words = mongoRDD.flatMap(new FlatMapFunction<Tuple2<Object,BSONObject>,
>    String>() {                                
>    @Override
>    public Iterable<String> call(Tuple2<Object, BSONObject> arg) {   
>    System.out.println(arg._2.toString());
>    ...
> In the above code i'm accesing fs.chunks collection as input to my mapper. so mapper
is taking it as BsonObject. but the problem is that input BSONObject data is in unreadable
binary format. for example the above program "System.out.println(arg._2.toString());" statement
giving following result:
>    { "_id" : { "$oid" : "533e53048f0c8bcb0b3a7ff7"} , "files_id" : { "$oid" : "533e5303fac7a2e2c4afea08"}
, "n" : 0 , "data" : <Binary Data>}
> How Do i print/access that data in readable format. Can i use GridFS Api to do that.
if so please suggest me how to convert input BSONObject to GridFS object and other best ways
to do...Thank you in Advance!!!

This message was sent by Atlassian JIRA

View raw message