spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From minerva <>
Subject Re: Out of memory exception in MLlib's naive baye's classification training
Date Thu, 20 Aug 2015 09:25:48 GMT
I used Mahout for Text Classification and now I'm trying with Spark.

I had the same Problem training Bayes with (only) 569 Documents.

I solved doing htf = HashingTF(5000) instead of htf = HashingTF() [default
Features Space 2^20). I don't know if it can be considered a longterm
Solution (what will it happen trying to train with much much more
Documents?) but I have two bigger issues at the moment.

My first issue at the moment is the creation of the LabeledPoint for the
Bayes Model.
The TFIDF Transformation gives back a RDD with Sparse Vector and I saved my
Labels (Categories) in another RDD.

I still didn't find an good solution to combine both Information while
creating the LabeledPoint. 
My Solution costs a lot of collects (one pro Document). Each collect takes 4
Sec (Running on a VM with 16GB RAM, 8 Core) and it results in circa 40
Minutes only to create the LabeledPoint after the TFIDF Calculation.....

My second Issue is that maybe saving Labels and Features separated and
combine them later could cause problems while running on more Nodes (now
running on a single Node) because I can not be sure that the Order of the
Labels I saved will match to the Order of Features in the Sparse

Is there a Post or "BestPractice" I can read to solve the two Issues?
Thanks a lot!

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message