tinkerpop-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marko Rodriguez <okramma...@gmail.com>
Subject Blade testing 3.2.0-SNAPSHOT (master/)
Date Tue, 05 Apr 2016 14:32:18 GMT
Hi,

So yesterday and this morning I manually tested TinkerPop 3.2.0-SNAPSHOT for our VOTE release
on Friday on 4 Blades using Friendster (2.5 billion edges). I noticed that Spark 1.6.1 is
fickle and Netty-based network errors occur "easily." I dropped back down to 1.5.2 and no
errors. I think one of the problems is GC in Spark 1.6.1 and using MEMORY_XXX storage levels.
I did DISK_ONLY and the issues went away on the simple query of g.V().count() (which only
repartitions -- no message passing). In 1.5.2 you get GC stalls with MEMORY_XXX storage levels,
but no [ERROR]s (and no stack traces w/ failed tasks). Next, I did a more complex query --
g.V().out().out().count() -- and Spark 1.6.1 had failed tasks even with DISK_ONLY. Bummer.
As a last check, I changed the proportion of SPARK_WORKER_INSTANCES to SPARK_WORKER_CORES
from 4/6 to 6/4 and everything started to work again with Spark 1.6.1.

In short, the memory management and workers/core-ratio in Spark 1.6.1 is "different" than
Spark 1.5.2. I was able to get the same speeds on 1.6.1 as with 1.5.2, I just had to do things
a little differently. In fact, 1.6.1 seems a bit faster -- a 55 minute job on 1.5.2 taking
50 minutes on 1.6.1.

I think it is safe to release TinkerPop 3.2.0 with Spark 1.6.1, but we will just have to be
ready to tell people to reduce the number of workers and to use DISK_ONLY if they are GC stalling
a lot. Finally, with this testing, I ensured that our bump to Hadoop 2.7.2 didn't cause any
problems and moreover, there were a few nick nack bugs around FileSystemStorage that I was
able to confirm no longer existed.

Thanks,
Marko.

http://markorodriguez.com


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message