jclouds-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Gaul <g...@apache.org>
Subject Re: aws-s3 etag when using multipart
Date Tue, 29 Sep 2015 19:01:11 GMT
S3 emits different ETags for single- and multi-part uploads.  You can
use both types of ETags for future conditional GET and PUT operations
but only single-part upload returns an MD5 hash.  Multi-part upload
returns an opaque token which is likely a hash of hashes combined with
number of parts.

You can ensure data integrity in-transit via comparing the ETag or via
providing a Content-MD5 for single-part uploads.  Multi-part is more
complicated; each upload part call can have a Content-MD5 and each call
returns the MD5 hash.  jclouds supplies the per-part ETag hashes to the
final complete multi-part upload call but does not provide a way to
check the results of per-part calls or a way to supply a Content-MD5 for
each.

Fixing this requires calculating the MD5 in
BaseBlobStore.putMultipartBlob.  We could either calculate it beforehand
for repeatable Payloads or compare afterwards for InputStream payloads.
There is some subtlety to this for providers like Azure which do not
return an MD5 ETag.  We would likely want to guard this with a property
since not every caller wants to pay the CPU overhead.  Would you like to
take a look at this?

If you want a purely application fix, look at calling the BlobStore
methods initiateMultipartUpload, uploadMultipartPart, and
completeMultipartUpload.  jclouds internally uses these to implement
putBlob(new PutOptions.multipart()).

On Tue, Sep 22, 2015 at 05:10:18PM +0200, Veit Guna wrote:
> Hi.
>  
> We're using jclouds 1.9.1 with the aws-s3 provider. Until now, we have used the returned
etag of blobStore.putBlob() to manually verify
> against a client provided hash. That worked quite well for us. But since we are hitting
the 5GB limit of S3, we switched to the multipart() upload
> that jclouds offers. But now, putBlob() returns someting like <md5-hash>-<number>
e.g. 90644a2d0c7b74483f8d2036f3e29fc5-2 that of course
> fails with our validation.
>  
> I guess this is due to the fact, that each chunk is hashed separately and send to S3.
So there is no complete hash over the whole payload that could
> be returned by putBlob() - is that correct?
>  
> During my research I stumbled across this:
>  
> https://github.com/jclouds/jclouds/commit/f2d897d9774c2c0225c199c7f2f46971637327d6
>  
> Now I'm wondering, what the contract of putBlob() is. Should it only return valid etag/hashes
otherwise return null?
>  
> I'm asking that, because otherwise, I would have to start parsing and validating the
returned value by myself and skip any
> validation when it isn't a normal md5 hash. My guess is, that this is the hash from the
last transferred chunk plus
> the chunk number?
>  
> Maybe someone can shed some light on this :).
>  
> Thanks
> Veit
>  

-- 
Andrew Gaul
http://gaul.org/

Mime
View raw message