spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brad Miller <>
Subject odd caching behavior or accounting
Date Mon, 30 Jun 2014 17:20:12 GMT
Hi All,

I've recently noticed some caching behavior which I did not understand
and may or may not have indicated a bug.  In short, the web UI seemed
to indicate that some blocks were being added to the cache despite
already being in cache.

As documentation, I have attached two UI screenshots.  The PNG
captures enough of the screen to demonstrate the problem; the PDF is
the printout of the full page.  Notice that:

-block rdd_21_1001 is in the cache twice, both times on; many other blocks also occur twice
on a variety of hosts.  I've not confirmed that the duplicate block is
*always* the same host but it seems to appear that way.

-the stated storage level is "Memory Deserialized 1x Replicated"

-the top left states that the "cached partitions" and "total
partitions" are 4000, but in the table where partitions are enumerated
there are 4534.

Although not reflected in this screenshot, I believe I have seen this
behavior occur even when double caching of blocks causes eviction of
blocks from other RDDs.  I am running the Spark 1.0.0 release and
using pyspark.


View raw message