spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ulanov, Alexander" <alexander.ula...@hp.com>
Subject RE: Storing large data for MLlib machine learning
Date Thu, 26 Mar 2015 21:51:12 GMT
Thanks, Evan. What do you think about Protobuf? Twitter has a library to manage protobuf files
in hdfs https://github.com/twitter/elephant-bird


From: Evan R. Sparks [mailto:evan.sparks@gmail.com]
Sent: Thursday, March 26, 2015 2:34 PM
To: Stephen Boesch
Cc: Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

On binary file formats - I looked at HDF5+Spark a couple of years ago and found it barely
JVM-friendly and very Hadoop-unfriendly (e.g. the APIs needed filenames as input, you couldn't
pass it anything like an InputStream). I don't know if it has gotten any better.

Parquet plays much more nicely and there are lots of spark-related projects using it already.
Keep in mind that it's column-oriented which might impact performance - but basically you're
going to want your features in a byte array and deser should be pretty straightforward.

On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <javadba@gmail.com<mailto:javadba@gmail.com>>
wrote:
There are some convenience methods you might consider including:

           MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <alexander.ulanov@hp.com<mailto:alexander.ulanov@hp.com>>:

> Hi,
>
> Could you suggest what would be the reasonable file format to store
> feature vector data for machine learning in Spark MLlib? Are there any best
> practices for Spark?
>
> My data is dense feature vectors with labels. Some of the requirements are
> that the format should be easy loaded/serialized, randomly accessible, with
> a small footprint (binary). I am considering Parquet, hdf5, protocol buffer
> (protobuf), but I have little to no experience with them, so any
> suggestions would be really appreciated.
>
> Best regards, Alexander
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message