Anyone have any guidance on using a broadcast variable to ship data to workers vs. an RDD?
Like, say I'm joining web logs in an RDD with user account data. I could keep the account data in an RDD or if it's "small", a broadcast variable instead. How small is small? Small enough that I know it can easily fit in memory on a single node? Some other guideline?