spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brad Miller <>
Subject odd caching behavior or accounting
Date Mon, 30 Jun 2014 21:29:45 GMT
Hi All,

I am resending this message because I suspect the original may have been
blocked from the mailing list due to attachments.  Note that the mail does
appear on the apache archives
not on nabble, the online archive linked from the Spark website

The text of the original message appears below; the PDF
 and PNG
original attached are now available as linked from the apache archive.


---------- Forwarded message ----------
From: Brad Miller <>
Date: Mon, Jun 30, 2014 at 10:20 AM
Subject: odd caching behavior or accounting

Hi All,

I've recently noticed some caching behavior which I did not understand
and may or may not have indicated a bug.  In short, the web UI seemed
to indicate that some blocks were being added to the cache despite
already being in cache.

As documentation, I have attached two UI screenshots.  The PNG
captures enough of the screen to demonstrate the problem; the PDF is
the printout of the full page.  Notice that:

-block rdd_21_1001 is in the cache twice, both times on; many other blocks also occur twice
on a variety of hosts.  I've not confirmed that the duplicate block is
*always* the same host but it seems to appear that way.

-the stated storage level is "Memory Deserialized 1x Replicated"

-the top left states that the "cached partitions" and "total
partitions" are 4000, but in the table where partitions are enumerated
there are 4534.

Although not reflected in this screenshot, I believe I have seen this
behavior occur even when double caching of blocks causes eviction of
blocks from other RDDs.  I am running the Spark 1.0.0 release and
using pyspark.


View raw message