spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Wylie <briford.wy...@gmail.com>
Subject plotting/resampling timeseries data
Date Thu, 21 Sep 2017 21:19:29 GMT
What I want to do:
- I have a dataframe with a timestamp column and a 'bytes' column, I want
to sum() the bytes and make a temporal plot.

Example code that shows the desired output:
-  Here we sample a spark_df and convert to a pandas_df just for
demonstration

time_df = spark_df['ts', 'orig_bytes']
time_df.printSchema()

root
 |-- ts: timestamp (nullable = true)
 |-- orig_bytes: long (nullable = true)


time_df.count()
4863740

pandas_df = time_df.sample(False, 0.1).toPandas()
len(pandas_df)

486469


pandas_df.set_index('ts', inplace=True) # Set timestamp as index

pandas_df['orig_bytes'].resample('1Min').sum().plot()
<nice bytes/min over time plot... see attached screenshot>

Okay, so this was probably overkill but I'm not seeing/finding a solution
and I'm sure it's easy and I'm just missing it. I've looked at the spark-ts
package (see links below) but it just seems like a fairly high jump in
complexity. It's just one line in Pandas so I'm hoping that it's relatively
simple...

I've computed/plotted histograms in Spark.. so that was pretty
easy...basically two lines...

# Show histogram of the Spark DF query lengths
bins, counts = spark_df.select('query_length').rdd.flatMap(lambda x:
x).histogram(20)

# This is a bit awkward but I believe this is the correct way to do it
plt.hist(bins[:-1], bins=bins, weights=counts)

Resources looked at so far:
- https://www.youtube.com/watch?v=tKkneWcAIqU
- https://www.slideshare.net/ilganeli/frustrationreduced-
spark-dataframes-and-the-spark-timeseries-library

Any pointers/suggestions are greatly appreciated.

Mime
View raw message