One issue is that 'big' becomes 'not so big' reasonably quickly. A couple of TeraBytes is not that challenging (depending on the algorithm) these days where as 5 years ago it was a big challenge. We have a bit over a PetaByte (not using Spark)  and using a distributed system is the only viable way to get reasonable performance for reasonable cost


On Tue, Mar 31, 2015 at 4:55 AM, Steve Loughran <> wrote:

On 30 Mar 2015, at 13:27, jay vyas <> wrote:

Just the same as spark was disrupting the hadoop ecosystem by changing the assumption that "you can't rely on memory in distributed analytics" maybe we are challenging the assumption that "big data analytics need to distributed"?

I've been asking the same question lately and seen similarly that spark performs quite reliably and well on local single node system even for an app which I ran for a streaming app which I ran for ten days in a row...  I almost felt guilty that I never put it on a cluster....!

Modern machines can be pretty powerful: 16 physical cores HT'd to 32, 384+MB, GPU, giving you lots of compute. What you don't get is the storage capacity to match, and especially, the IO bandwidth. RAID-0 striping 2-4 HDDs gives you some boost, but if you are reading, say, a 4 GB file from HDFS broken in to 256MB blocks, you have that data  replicated into (4*4*3) blocks: 48. Algorithm and capacity permitting, you've just massively boosted your load time. Downstream, if data can be thinned down, then you can start looking more at things you can do on a single host : a machine that can be in your Hadoop cluster. Ask YARN nicely and you can get a dedicated machine for a couple of days (i.e. until your Kerberos tokens expire).


Franc Carter     I      Systems Architect    I     RoZetta Technology


Description: Description: Description: cid:image003.jpg@01D02903.9B540580


L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000

PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA

T  +61 2 8355 2515     I


DISCLAIMER: The contents of this email, inclusive of attachments, may be legally

privileged and confidential. Any unauthorised use of the contents is expressly prohibited.