mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chandra Mohan, Ananda Vel Murugan" <>
Subject RE: significance of FEATURES in SGD
Date Thu, 04 Jul 2013 07:24:40 GMT
Is there any way to parallelize SGD to make it train faster? I have 2million training samples,
it takes at least 5-6 hours to complete the training. I tried new group training. It takes
around 40 minutes. I understand it has just 10000 records. Is there any tuning parameter to
improve the performance?

-----Original Message-----
From: Ted Dunning [] 
Sent: Wednesday, July 03, 2013 11:05 PM
Subject: Re: significance of FEATURES in SGD

The dimensionality of the feature vector definitely has a large impact on
accuracy as well as on the cost of the learning process.

I would be very surprised if you get good accuracy with a feature vector
with dimension 100.  Even 10,000 may be a bit small but with multiple
probes it may well work.

Your speed issues may also have to do with memory size.  Make sure you give
the process enough heap space to drive garbage collection overhead very low.

On Wed, Jul 3, 2013 at 5:58 AM, Chandra Mohan, Ananda Vel Murugan <> wrote:

> Hi,
> I am experimenting Mahout for text classification. I have 2 million
> training data i.e text of approximately 20 words. They fall into 121
> categories. I tried AdaptiveLogisticRegression. When I create sparse vector
> of cardinality 10000, it takes hours to converge, but when I tried with 100
> it converges fast. Is this measure very significant in determining the
> accuracy of the model? Please advise.
> Regards,
> Anand.C
View raw message