spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ulanov, Alexander" <>
Subject RE: Storing large data for MLlib machine learning
Date Thu, 26 Mar 2015 21:33:08 GMT
Thanks for suggestion, but libsvm is a format for sparse data storing in text file and I have
dense vectors. In my opinion, text format is not appropriate for storing large dense vectors
due to overhead related to parsing from string to digits and also storing digits as strings
is not efficient.

From: Stephen Boesch []
Sent: Thursday, March 26, 2015 2:27 PM
To: Ulanov, Alexander
Subject: Re: Storing large data for MLlib machine learning

There are some convenience methods you might consider including:


and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <<>>:

Could you suggest what would be the reasonable file format to store feature vector data for
machine learning in Spark MLlib? Are there any best practices for Spark?

My data is dense feature vectors with labels. Some of the requirements are that the format
should be easy loaded/serialized, randomly accessible, with a small footprint (binary). I
am considering Parquet, hdf5, protocol buffer (protobuf), but I have little to no experience
with them, so any suggestions would be really appreciated.

Best regards, Alexander

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message