spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeremy Freeman (JIRA)" <>
Subject [jira] [Commented] (SPARK-4727) Add "dimensional" RDDs (time series, spatial)
Date Thu, 04 Dec 2014 15:22:12 GMT


Jeremy Freeman commented on SPARK-4727:

Great to brainstorm about this RJ! 

To some extent, we've been doing this over on the [Thunder|]
project. In particular, check out the {{TimeSeries}} and {{Images}} classes [here|],
which are essentially wrappers for specialized RDDs. Our basic abstraction is RDDs of ndarrays
(1D for time series, 2D or 3D for images/volumes), with metadeta (lazily propagated) for things
like dimensionality and time base, coordinates embedded in keys, and useful methods on these
objects like the ones you menion (e.g. filtering, fourier, cross-correlation). We've also
worked on transformations between representations, for the common case of sequences of images
corresponding to different time points.

We haven't worked on custom partition strategies yet, I think that will be most important
for image tiles drawn from a much larger image. There's cool work ongoing for that in GeoTrellis,
see the [repo|] and a [talk|]
from Rob.

FWIW, when we started it seemed more appropriate to build this into a specialized library,
rather than Spark core. It's also something that benefits from using Python, due to a bevy
of existing libraries for  temporal and image data (though there are certainly analogs in
Java/Scala). But it would be great to probe the community for general interest in these kinds
of abstractions and methods.

> Add "dimensional" RDDs (time series, spatial)
> ---------------------------------------------
>                 Key: SPARK-4727
>                 URL:
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: Spark Core
>    Affects Versions: 1.1.0
>            Reporter: RJ Nowling
> Certain types of data (times series, spatial) can benefit from specialized RDDs.  I'd
like to open a discussion about this.
> For example, time series data should be ordered by time and would benefit from operations
> * Subsampling (taking every n data points)
> * Signal processing (correlations, FFTs, filtering)
> * Windowing functions
> Spatial data benefits from ordering and partitioning along a 2D or 3D grid.  For example,
path finding algorithms can optimized by only comparing points within a set distance, which
can be computed more efficiently by partitioning data into a grid.
> Although the operations on time series and spatial data may be different, there is some
commonality in the sense of the data having ordered dimensions and the implementations may

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message