I'm learning to use Spark with MongoDB, but I've encountered a problem that I think is related to the way I use Spark., because it doesn't make any sense to me.
My concept test is that I want to filter a collection containing about 800K documents by a certain field.
My code is very simple. Connect to my MongoDB, apply a filter transformation and then count the elements:
JavaSparkContext sc = new JavaSparkContext("local", "Spark Test");
Configuration config = new Configuration();
JavaPairRDD<Object, BSONObject> mongoRDD = sc.newAPIHadoopRDD(config, com.mongodb.hadoop.MongoInputFormat.class, Object.class, BSONObject.class);
long numberOfFilteredElements = mongoRDD
.filter(myCollectionDocument -> myCollectionDocument._2().get("site").equals("marfeel.com
System.out.format("Filtered collection size: %d%n", numberOfFilteredElements);
When I execute this code, the Mongo driver splits my collection into 2810 partitions, so equal number of tasks start to process.
About the task number 1000, I get the following error message:
ERROR Executor: Exception in task 990.0 in stage 0.0 (TID 990)
java.lang.OutOfMemoryError: unable to create new native thread
I've searched a lot about this error, but it doesn't make any sense to me. I came up the conclusion that I have a problem with my code, that I have some library versions incompatibilities or that my real problem is that I'm getting the whole Spark concept wrong, and that the code above doesn't make any sense at all.
I'm using the following library versions:
org.apache.spark.spark-core_2.11 -> 1.2.0
org.apache.hadoop.hadoop-client -> 2.4.1
org.mongodb.mongo-hadoop.mongo-hadoop-core -> 1.3.1
org.mongodb.mongo-java-driver -> 2.13.0-rc1
Thanks a lot for your help!
UK: +44 207 048 37 28 ext. 107