Hi Nick,
I'm also working on the benchmark of liner models in Spark. :)
One thing I saw is that for sparse features, 14 million features, with
multidepth aggregation, the final aggregation to the driver is
extremely slow. See the attachment. The amount of data being exchanged
between executor and executor is significantly larger than collecting
the data into driver, but the time for collecting the data back to
driver takes 4mins while the aggregation between executors only takes
20secs. Seems that the code path is different, and I suspect that
there may be something in the spark core that we can optimize.
Regrading using sparse data structure for aggregation, I'm not so sure
how much this will improve the performance. Since after computing the
gradient sum for all the data in one partitions, the vector will be no
longer to be very sparse. Even it's sparse, after couple depth of
aggregation, it will be very dense. Also, we perform the compression
in the shuffle phase, so if there are many zeros, even it's in dense
vector representation, the vector should take around the same size as
sparse representation. I can be wrong since I never do a study on
this, and I wonder how much performance we can gain in practice by
using sparse vector for aggregating the gradients.
Sincerely,
DB Tsai

Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D
On Thu, Oct 6, 2016 at 4:09 AM, Nick Pentreath <nick.pentreath@gmail.com> wrote:
> I'm currently working on various performance tests for large, sparse feature
> spaces.
>
> For the Criteo DAC data  45.8 million rows, 34.3 million features
> (categorical, extremely sparse), the time per iteration for
> ml.LogisticRegression is about 2030s.
>
> This is with 4x worker nodes, 48 cores & 120GB RAM each. I haven't yet tuned
> the tree aggregation depth. But the number of partitions can make a
> difference  generally fewer is better since the cost is mostly
> communication of the gradient (the gradient computation is < 10% of the
> periteration time).
>
> Note that the current impl forces dense arrays for intermediate data
> structures, increasing the communication cost significantly. See this PR for
> info: https://github.com/apache/spark/pull/12761. Once sparse data
> structures are supported for this, the linear models will be orders of
> magnitude more scalable for sparse data.
>
>
> On Wed, 5 Oct 2016 at 23:37 DB Tsai <dbtsai@dbtsai.com> wrote:
>>
>> With the latest code in the current master, we're successfully
>> training LOR using Spark ML's implementation with 14M sparse features.
>> You need to tune the depth of aggregation to make it efficient.
>>
>> Sincerely,
>>
>> DB Tsai
>> 
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0x9DCC1DBD7FC7BBB2
>>
>>
>> On Wed, Oct 5, 2016 at 12:00 PM, Yang <teddyyyy123@gmail.com> wrote:
>> > anybody had actual experience applying it to real problems of this
>> > scale?
>> >
>> > thanks
>> >
>>
>> 
>> To unsubscribe email: userunsubscribe@spark.apache.org
>>
>
