spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Ioannis.Deligian...@nomura.com>
Subject RDD persist() not honoured
Date Fri, 25 Nov 2016 14:23:17 GMT
Hi,

I have run into a weird caching problem (Using Spark 1.3.1 + Java 1.8.0) that I can only explain
as a bug.

In summary, I source the RDD from an Avro file, I apply a mapToPair Function, count &
cache. However, the RDD is not cached nor it appears in Spark UI Storage. (This is not cached
at all, not even partially)
                JavaSparkContext ctx = …;
JavaRDD a = ….;
JavaPairRDD b =  a.mapToPaiR(..).cache();
b.count(); //RDD is not cached.

I looked around but could not find any known bugs around this.

I debugged the b RDD and it is set as cached:
(80) MapPartitionsRDD[31] at mapToPair at ABC.java:684 [Memory Deserialized 1x Replicated]
|   RDD1 MapPartitionsRDD[22] at map at XXXAvroDao.java:xx [Memory Deserialized 1x Replicated]
|   MapPartitionsRDD[21] at keys at XXXAvroDao.java:xx [Memory Deserialized 1x Replicated]
|   maprfs:/mapr/XXX NewHadoopRDD[20] at newAPIHadoopFile at XXXAvroDao.java:xx [Memory Deserialized
1x Replicated]

I also checked the b RDD storage level using a debugger and it seems correctly set as well.
StorageLevel(false, true, false, true, 1)

Now thing get more interesting as the following does result in cached rdd:
               a.cache().count();

Also the following works:
                ctx.parallelise(b.take(1000)).cache().count();

However, any attempts to “fool” b.cache() fail as well(action completes but data are not
cached at all). E.g.
                b.repartition(150).cache().count();
b.values().cache().count();
b.keys().cache().count();
                b.persist(StorageLevel.DISK_ONLY()).count();
                b.persist(StorageLevel.MEMORY_ONLY()).count();
                b.persist(StorageLevel.MEMORY_ONLY_SER()).count();
b.unpersist().cache().count();


I haven’t managed to replicate the issue without the exact data, to be able to provide a
reproducible example as it works just fine in any other data types I have or any example I
tried.

Any ideas on where I should look?

Thanks.


This e-mail (including any attachments) is private and confidential, may contain proprietary
or privileged information and is intended for the named recipient(s) only. Unintended recipients
are strictly prohibited from taking action on the basis of information in this e-mail and
must contact the sender immediately, delete this e-mail (and all attachments) and destroy
any hard copies. Nomura will not accept responsibility or liability for the accuracy or completeness
of, or the presence of any virus or disabling code in, this e-mail. If verification is sought
please request a hard copy. Any reference to the terms of executed transactions should be
treated as preliminary only and subject to formal written confirmation by Nomura. Nomura reserves
the right to retain, monitor and intercept e-mail communications through its networks (subject
to and in accordance with applicable laws). No confidentiality or privilege is waived or lost
by Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is a reference
to any entity in the Nomura Holdings, Inc. group. Please read our Electronic Communications
Legal Notice which forms part of this e-mail: http://www.Nomura.com/email_disclaimer.htm

Mime
View raw message