From Steve Loughran <>
Subject Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?
Date Mon, 30 Mar 2015 17:55:21 GMT

On 30 Mar 2015, at 13:27, jay vyas <<>>

Just the same as spark was disrupting the hadoop ecosystem by changing the assumption that
"you can't rely on memory in distributed analytics" maybe we are challenging the assumption
that "big data analytics need to distributed"?

I've been asking the same question lately and seen similarly that spark performs quite reliably
and well on local single node system even for an app which I ran for a streaming app which
I ran for ten days in a row...  I almost felt guilty that I never put it on a cluster....!

Modern machines can be pretty powerful: 16 physical cores HT'd to 32, 384+MB, GPU, giving
you lots of compute. What you don't get is the storage capacity to match, and especially,
the IO bandwidth. RAID-0 striping 2-4 HDDs gives you some boost, but if you are reading, say,
a 4 GB file from HDFS broken in to 256MB blocks, you have that data  replicated into (4*4*3)
blocks: 48. Algorithm and capacity permitting, you've just massively boosted your load time.
Downstream, if data can be thinned down, then you can start looking more at things you can
do on a single host : a machine that can be in your Hadoop cluster. Ask YARN nicely and you
can get a dedicated machine for a couple of days (i.e. until your Kerberos tokens expire).

