spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Theodore Vasiloudis (JIRA)" <>
Subject [jira] [Commented] (SPARK-2394) Make it easier to read LZO-compressed files from EC2 clusters
Date Mon, 23 Mar 2015 11:28:11 GMT


Theodore Vasiloudis commented on SPARK-2394:

Just adding some more info here for people who end up here through searches:

Steps 1-3 can be completed by running this script on each machine on you cluster:

There should be an easy way to execute this script when the cluster is being launched, I tried
using the --user-data flag but that doesn't seem to do that. Otherwise you'd have to rsync
this script into each machine (easy, use ~/spark-ec2/copy-dir after you've copied the file
to you master) and then ssh into each machine and run it (not so easy)

For Step 4, make sure that the core-site.xml in changed in both the hadoop config, as well
as the spark-conf/ directory. Also as suggested in the hadoop-lzo docs 

Note that there seems to be a bug in /path/to/hadoop/bin/hadoop; comment out the line:



Here's how I set the vars in

export SPARK_SUBMIT_LIBRARY_PATH="$SPARK_SUBMIT_LIBRARY_PATH:/root/persistent-hdfs/lib/native/:/root/hadoop-native:/root/hadoop-lzo/target/native/Linux-amd64-64/lib:/usr/lib64/"
export SPARK_SUBMIT_CLASSPATH="$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar"

And what I added to both core-site.xml



As for the code (Step 5) itself, I've tried the different variations suggested in the mailing
list and other places and ended up using the following:

Note that this uses the sequenceFile reader, specifically for the Google Ngrams. Setting the
minPartitions is important in order to get good parallelization with what you with the data
later on. (3*cores in your cluster seems like a good value)

You can run the above job using:

./bin/spark-submit --jars local:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar --class --master $SPARK_MASTER $SPARK_JAR dummy_arg

you should of course set the env variables for you spark master and the location of your fat
Note that I'm passing the hadoop-lzo jar as local, that assumes that every node has built
the jar, which is done by the script given above.

Do the above and you should get the count and the first line of the data when running the

> Make it easier to read LZO-compressed files from EC2 clusters
> -------------------------------------------------------------
>                 Key: SPARK-2394
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: EC2, Input/Output
>    Affects Versions: 1.0.0
>            Reporter: Nicholas Chammas
>            Priority: Minor
>              Labels: compression
> Amazon hosts [a large Google n-grams data set on S3|].
This data set is perfect, among other things, for putting together interesting and easily
reproducible public demos of Spark's capabilities.
> The problem is that the data set is compressed using LZO, and it is currently more painful
than it should be to get your average {{spark-ec2}} cluster to read input compressed in this
> This is what one has to go through to get a Spark cluster created with {{spark-ec2}}
to read LZO-compressed files:
> # Install the latest LZO release, perhaps via {{yum}}.
> # Download [{{hadoop-lzo}}|] and build it. To build
{{hadoop-lzo}} you need Maven. 
> # Install Maven. For some reason, [you cannot install Maven with {{yum}}|],
so install it manually.
> # Update your {{core-site.xml}} and {{}} with [the appropriate configs|].
> # Make [the appropriate calls|]
to {{sc.newAPIHadoopFile}}.
> This seems like a bit too much work for what we're trying to accomplish.
> If we expect this to be a common pattern -- reading LZO-compressed files from a {{spark-ec2}}
cluster -- it would be great if we could somehow make this less painful.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message