spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Wylie <>
Subject pyspark histogram
Date Wed, 27 Sep 2017 15:50:26 GMT
Hi All,

My google/SO searching is somehow failing on this I simply want to compute
histograms for a column in a Spark dataframe.

There are two SO hits on this question:

I've actually submitted an answer to the second one but I realize that my
answer is wrong because even though this code looks fine/simple... in my
testing the flatmap is 'bad' because it really slows down things down
because it's pulling each value into python.

# Show histogram of the 'C1' column
bins, counts ='C1').rdd.flatMap(lambda x: x).histogram(20)
# This is a bit awkward but I believe this is the correct way to do it
plt.hist(bins[:-1], bins=bins, weights=counts)

I've looked at QuantileDiscretizer..

from import QuantileDiscretizer
discretizer = QuantileDiscretizer(numBuckets=20, inputCol='query_length',
result =

but I feel like this might be the wrong path.... so... the general question
is what's the best way to compute histograms in pyspark on columns that
have a large number of rows?

View raw message