libcloud-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Richards <ch...@infiniteio.com>
Subject Re: How to shorten file download time?
Date Tue, 02 Sep 2014 19:20:33 GMT
Thanks. I am doing something similar to #1 but since I'm using Python 3,
I'm using concurrent.futures. I can't say it compares it to gevent, but it
scales out to at least 20 threads in my limited testing (meaning, the time
to download 20 files is the same it takes to download 1).

I explored #2 with success, but I'm unsure about all those 'None' params
I'm passing in. :)  I did figure out that I have to store off the obj.size
for the download to complete (it checks it at the end ._save_object()). My
simple code (for others that may googly-search for similar questions) is:

    print ("Timing via direct object download")
    with timer.Timer(verbose=True, name='obj.download()'):
        size = data.size  # Need the object size; we would store this.

        # Object (name, size, hash, extra, meta_data, container, driver)
        con = Container (container_name, extra=None, driver=driver)
        obj = Object (object_name, size=size, hash=None, extra=None,
meta_data=None, container=con, driver=driver)
        ok = obj.download (target, overwrite_existing=True)
    print ("Complete [ok=%s], downloaded %s\n" % (ok, b2h (obj.size)))


Does this pretty much cover it? In your experience, is the extra data
filled in from container.get_object() not needed?


Cheers,

Chris



On Tue, Sep 2, 2014 at 12:29 PM, Tomaz Muraus <tomaz@apache.org> wrote:

> It depends on your use-case, but in general:
>
> 1. Downloading multiple files
>
> If you want to download multiple files / objects, you can parallelize this
> process. You can either do this by downloading each object in a separate
> thread or process and / or by utilizing a thread or process pool.
>
> If you want speed things up and reduce thread / process overhead, you
> should also have a look at gevent (http://www.gevent.org/).
>
> That's the approach I use in file_syncer where a common case is that
> multiple independent operations are performed in parallel (downloading /
> uploading files) -
>
> https://github.com/Kami/python-file-syncer/blob/master/file_syncer/syncer.py#L143
>
> 2. Downloading a single file / container and object ID is known in advance
>
> If you know the container and object ID in advance, you can avoid 2 HTTP
> requests (get_container, get_object) by manually instantiating Container
> and Object class with the known IDs. There are some examples of how to do
> that at
> https://libcloud.readthedocs.org/en/latest/other/working-with-oo-apis.html
>
> In this case, using gevent wouldn't really speed things up much since you
> are only issuing one HTTP request (unless an object is composed of multiple
> chunks and provider allows you to retrieve chunks independently...).
>
>
> On Tue, Sep 2, 2014 at 5:30 PM, Chris Richards <chris@infiniteio.com>
> wrote:
>
> > Howdy. I've noticed a variance in the download time of a file depending
> on
> > the method of download, and I'm hoping to shave off overhead. I'm using
> the
> > standard s3 provider. That stats I present are consistent between my
> office
> > and my home within +/-100 ms.  In shortened form:
> >
> > -
> > Timing via driver.get_container().get_object().download()
> > get_container: 431.4596652984619 ms
> > get_object: 808.0205917358398 ms
> > download: 8257.043838500977 ms
> > Complete, downloaded 8.15 MB
> >
> > Timing via driver.get_object().download()
> > get_object: 811.8221759796143 ms
> > download: 4801.661729812622 ms
> > Complete, downloaded 8.15 MB
> > -
> >
> >
> > In the first case, it appears that getting the container has significant
> > overhead and should be avoided if possible--which I can do--and trims
> > 400-500 ms per download (for small files, this is significant). Is my
> > observation and conclusion correct?
> >
> > In the second case, I want to examine is the .get_object() requirement to
> > download a file. This adds another significant overhead on the order of
> > 700-900 ms. Is there a way to bypass this?  I have many small files where
> > the .get_object() time exceeds that of the .download() time!
> >
> > import std.newbie.disclaimer
> >
> > Thanks!
> > Chris
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message