spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stuart Layton <stuart.lay...@gmail.com>
Subject What are the best options for quickly filtering a DataFrame on a single column?
Date Wed, 25 Mar 2015 14:41:55 GMT
I have a SparkSQL dataframe with a a few billion rows that I need to
quickly filter down to a few hundred thousand rows, using an operation like
(syntax may not be correct)

df = df[ df.filter(lambda x: x.key_col in approved_keys)]

I was thinking about serializing the data using parquet and saving it to
S3, however as I want to optimize for filtering speed I'm not sure this is
the best option.

-- 
Stuart Layton

Mime
View raw message