spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ulanov, Alexander" <>
Subject RE: Storing large data for MLlib machine learning
Date Wed, 01 Apr 2015 20:41:22 GMT
Jeremy, thanks for explanation!
What if instead you've used Parquet file format? You can still write a number of small files
as you do, but you don't have to implement a writer/reader, because they are available for
Parquet in various languages.

From: Jeremy Freeman []
Sent: Wednesday, April 01, 2015 1:37 PM
To: Hector Yee
Cc: Ulanov, Alexander; Evan R. Sparks; Stephen Boesch;
Subject: Re: Storing large data for MLlib machine learning

@Alexander, re: using flat binary and metadata, you raise excellent points! At least in our
case, we decided on a specific endianness, but do end up storing some extremely minimal specification
in a JSON file, and have written importers and exporters within our library to parse it. While
it does feel a little like reinvention, it's fast, direct, and scalable, and seems pretty
sensible if you know your data will be dense arrays of numerical features.


On Apr 1, 2015, at 3:52 PM, Hector Yee <<>>

Just using sc.textfile then a .map(decode)
Yes by default it is multiple files .. our training data is 1TB gzipped
into 5000 shards.

On Wed, Apr 1, 2015 at 12:32 PM, Ulanov, Alexander <<>>

Thanks, sounds interesting! How do you load files to Spark? Did you
consider having multiple files instead of file lines?

*From:* Hector Yee []
*Sent:* Wednesday, April 01, 2015 11:36 AM
*To:* Ulanov, Alexander
*Cc:* Evan R. Sparks; Stephen Boesch;<>

*Subject:* Re: Storing large data for MLlib machine learning

I use Thrift and then base64 encode the binary and save it as text file
lines that are snappy or gzip encoded.

It makes it very easy to copy small chunks locally and play with subsets
of the data and not have dependencies on HDFS / hadoop for server stuff for

On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander <<>> wrote:

Thanks, Evan. What do you think about Protobuf? Twitter has a library to
manage protobuf files in hdfs

From: Evan R. Sparks []
Sent: Thursday, March 26, 2015 2:34 PM
To: Stephen Boesch
Cc: Ulanov, Alexander;<>
Subject: Re: Storing large data for MLlib machine learning

On binary file formats - I looked at HDF5+Spark a couple of years ago and
found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
needed filenames as input, you couldn't pass it anything like an
InputStream). I don't know if it has gotten any better.

Parquet plays much more nicely and there are lots of spark-related
projects using it already. Keep in mind that it's column-oriented which
might impact performance - but basically you're going to want your features
in a byte array and deser should be pretty straightforward.

On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <<><mailto:<>>> wrote:
There are some convenience methods you might consider including:


and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <<>


Could you suggest what would be the reasonable file format to store
feature vector data for machine learning in Spark MLlib? Are there any

practices for Spark?

My data is dense feature vectors with labels. Some of the requirements

that the format should be easy loaded/serialized, randomly accessible,

a small footprint (binary). I am considering Parquet, hdf5, protocol

(protobuf), but I have little to no experience with them, so any
suggestions would be really appreciated.

Best regards, Alexander


Yee Yang Li Hector <>

*<> <>*

Yee Yang Li Hector <>
*<> <>*

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message