spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Siegmann <>
Subject Re: Spark ML - Scaling logistic regression for many features
Date Fri, 11 Mar 2016 16:35:27 GMT
On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <>

> Would you mind letting us know the # training examples in the datasets?
> Also, what do your features look like? Are they text, categorical etc? You
> mention that most rows only have a few features, and all rows together have
> a few 10,000s features, yet your max feature value is 20 million. How are
> your constructing your feature vectors to get a 20 million size? The only
> realistic way I can see this situation occurring in practice is with
> feature hashing (HashingTF).

The sub-sample I'm currently training on is about 50K rows, so ... small.

The features causing this issue are numeric (int) IDs for ... lets call it
"Thing". For each Thing in the record, we set the feature to a
value of 1.0 in our vector (which is of course a SparseVector). I'm not
sure how IDs are generated for Things, but they can be large numbers.

The largest Thing ID is around 20 million, so that ends up being the size
of the vector. But in fact there are fewer than 10,000 unique Thing IDs in
this data. The mean number of features per record in what I'm currently
training against is 41, while the maximum for any given record was 1754.

It is possible to map the features into a small set (just need to
zipWithIndex), but this is undesirable because of the added complexity (not
just for the training, but also anything wanting to score against the
model). It might be a little easier if this could be encapsulated within
the model object itself (perhaps via composition), though I'm not sure how
feasible that is.

But I'd rather not bother with dimensionality reduction at all - since we
can train using liblinear in just a few minutes, it doesn't seem necessary.

> MultivariateOnlineSummarizer uses dense arrays, but it should be possible
> to enable sparse data. Though in theory, the result will tend to be dense
> anyway, unless you have very many entries in the input feature vector that
> never occur and are actually zero throughout the data set (which it seems
> is the case with your data?). So I doubt whether using sparse vectors for
> the summarizer would improve performance in general.

Yes, that is exactly my case - the vast majority of entries in the input
feature vector will *never* occur. Presumably that means most of the values
in the aggregators' arrays will be zero.

> LR doesn't accept a sparse weight vector, as it uses dense vectors for
> coefficients and gradients currently. When using L1 regularization, it
> could support sparse weight vectors, but the current implementation doesn't
> do that yet.

Good to know it is theoretically possible to implement. I'll have to give
it some thought. In the meantime I guess I'll experiment with coalescing
the data to minimize the communication overhead.

Thanks again.

View raw message