spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brad Miller <bmill...@eecs.berkeley.edu>
Subject odd caching behavior or accounting
Date Mon, 30 Jun 2014 21:29:45 GMT
Hi All,

I am resending this message because I suspect the original may have been
blocked from the mailing list due to attachments.  Note that the mail does
appear on the apache archives
<http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3CCANR-kKeO3mxL1QuX0fnz0DEPkU4FFbXO2W_5CdmtrzYKUfhaBg%40mail.gmail.com%3E>
but
not on nabble, the online archive linked from the Spark website
<http://apache-spark-user-list.1001560.n3.nabble.com/>.

The text of the original message appears below; the PDF
<http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/raw/%3cCANR-kKeO3mxL1QuX0fnz0DEPkU4FFbXO2W_5CdmtrzYKUfhaBg@mail.gmail.com%3e/2>
 and PNG
<http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/raw/%3cCANR-kKeO3mxL1QuX0fnz0DEPkU4FFbXO2W_5CdmtrzYKUfhaBg@mail.gmail.com%3e/3>
files
original attached are now available as linked from the apache archive.

best,
-Brad


---------- Forwarded message ----------
From: Brad Miller <bmiller1@eecs.berkeley.edu>
Date: Mon, Jun 30, 2014 at 10:20 AM
Subject: odd caching behavior or accounting
To: user@spark.apache.org


Hi All,

I've recently noticed some caching behavior which I did not understand
and may or may not have indicated a bug.  In short, the web UI seemed
to indicate that some blocks were being added to the cache despite
already being in cache.

As documentation, I have attached two UI screenshots.  The PNG
captures enough of the screen to demonstrate the problem; the PDF is
the printout of the full page.  Notice that:

-block rdd_21_1001 is in the cache twice, both times on
letang.research.intel-research.net; many other blocks also occur twice
on a variety of hosts.  I've not confirmed that the duplicate block is
*always* the same host but it seems to appear that way.

-the stated storage level is "Memory Deserialized 1x Replicated"

-the top left states that the "cached partitions" and "total
partitions" are 4000, but in the table where partitions are enumerated
there are 4534.

Although not reflected in this screenshot, I believe I have seen this
behavior occur even when double caching of blocks causes eviction of
blocks from other RDDs.  I am running the Spark 1.0.0 release and
using pyspark.

best,
-Brad

Mime
View raw message