spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Duffy <>
Subject Re: Leveraging S3 select
Date Fri, 08 Dec 2017 17:05:58 GMT
Hey Steve,

Happen to have a link to the TPC-DS benchmark data w/random S3 reads? I've done a decent amount
of digging, but all I've found is a reference in a slide deck and some jira tickets.

From: Steve Loughran <>
Date: Tuesday, December 5, 2017 at 09:44
To: "Lalwani, Jayesh" <>
Cc: Apache Spark Dev <>
Subject: Re: Leveraging S3 select

On 29 Nov 2017, at 21:45, Lalwani, Jayesh <<>>

AWS announced at re:Invent that they are launching S3 Select. This can allow Spark to push
down predicates to S3, rather than read the entire file in memory. Are there any plans to
update Spark to use S3 Select?

  1.  ORC and Parquet don't read the whole file in memory anyway, except in the special case
that the file is gzipped
  2.  Hadoop's s3a <= 2.7 doesn't handle the aggressive seeks of those columnar formats
that well, as it does a GET pos-EOF & has to abort the TCP connection if the seek is backwards
  3.  Hadoop 2.8+ with spark.hadoop.fs.s3a.experimental.fadvise=random switches to random
IO and only does smaller GET reads of the data requested (actually min(min-read-length, buffer-size).
This delivers ~3x performance boost in TCP-DS benchmarks

I don't yet know how much more efficient the new mechanism will be against columnar data,
given those facts. You'd need to do experiments

The place to implement this would be though predicate push down from the file format to the
FS. ORC & Parquet support predicate pushdown, so they'd need to recognise when the underlying
store could do some of the work for them, open the store input stream differently, and use
a whole new (undefined?) API to the queries. Most likely: s3a would add a way to specify a
predicate to select on in open(), as well as the expected file type. This would need the underlying
mechanism to also support those formats though, which the announcement doesn't/

Someone could do something more immediately though some modified CSV data source which did
the pushdown. However, If you are using CSV for your datasets, there's something fundamental
w.r.t your data storage policy you need to look at. It works sometimes as an exchange format,
though I prefer Avro there due to its schemas and support for more complex structures.  As
a format you run queries over? No.
View raw message