spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Pentreath <nick.pentre...@gmail.com>
Subject Re: can mllib Logistic Regression package handle 10 million sparse features?
Date Thu, 06 Oct 2016 11:09:55 GMT
I'm currently working on various performance tests for large, sparse
feature spaces.

For the Criteo DAC data - 45.8 million rows, 34.3 million features
(categorical, extremely sparse), the time per iteration for
ml.LogisticRegression is about 20-30s.

This is with 4x worker nodes, 48 cores & 120GB RAM each. I haven't yet
tuned the tree aggregation depth. But the number of partitions can make a
difference - generally fewer is better since the cost is mostly
communication of the gradient (the gradient computation is < 10% of the
per-iteration time).

Note that the current impl forces dense arrays for intermediate data
structures, increasing the communication cost significantly. See this PR
for info: https://github.com/apache/spark/pull/12761. Once sparse data
structures are supported for this, the linear models will be orders of
magnitude more scalable for sparse data.


On Wed, 5 Oct 2016 at 23:37 DB Tsai <dbtsai@dbtsai.com> wrote:

> With the latest code in the current master, we're successfully
> training LOR using Spark ML's implementation with 14M sparse features.
> You need to tune the depth of aggregation to make it efficient.
>
> Sincerely,
>
> DB Tsai
> ----------------------------------------------------------
> Web: https://www.dbtsai.com
> PGP Key ID: 0x9DCC1DBD7FC7BBB2
>
>
> On Wed, Oct 5, 2016 at 12:00 PM, Yang <teddyyyy123@gmail.com> wrote:
> > anybody had actual experience applying it to real problems of this scale?
> >
> > thanks
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message