Concurrent downloads with Python using asyncio or thread pools


When downloading a large number of files with Python, you are I/O bound. A vanilla implementation with requests like the one below would yield sequential, blocking calls with files downloaded one at a time.

import time
import requests

start = time.perf_counter()
urls = [
    "https://i.imgur.com/AD3MbBi.jpeg", 
    "https://i.imgur.com/zYhkOrM.jpeg",
    "https://i.imgur.com/LRoLTlK.jpeg",
    "https://i.imgur.com/gtWsPu9.jpeg",
    "https://i.imgur.com/jDimNTZ.jpeg",
]
for url in urls:
    print(f"Downloading {url}")
    resp = requests.get(url)
    with open(url.split("/")[-1], 'wb') as f:
        f.write(resp.content)
        print(f"Done downloading {url}")

print(f"Total time: {time.perf_counter() - start}")

In the snippet above, when requests.get(url, stream=True) is called, the CPU enters an idle state and waits for a response from the server. Once response is received, it proceeds to save the file & process the next url as evident from the output below.

Downloading https://i.imgur.com/AD3MbBi.jpeg
Done downloading https://i.imgur.com/AD3MbBi.jpeg
Downloading https://i.imgur.com/zYhkOrM.jpeg
Done downloading https://i.imgur.com/zYhkOrM.jpeg
Downloading https://i.imgur.com/LRoLTlK.jpeg
Done downloading https://i.imgur.com/LRoLTlK.jpeg
Downloading https://i.imgur.com/gtWsPu9.jpeg
Done downloading https://i.imgur.com/gtWsPu9.jpeg
Downloading https://i.imgur.com/jDimNTZ.jpeg
Done downloading https://i.imgur.com/jDimNTZ.jpeg
Total time: 1.2269745559970033

The core idea to speed this up is to fire requests without waiting for a response, essentially pushing data throughput closer to server or channel capacity. In Python, this can be done in a few ways and I'll cover two popular ones - process/thread pools and asyncio. Also, while the examples here reference downloads, it can be applied to any I/O bound task.

1. Process/Thread pools

One familiar approach here would be to create multiple processes/threads and fire requests in parallel. While you can do this with multiprocessing or threading modules, you should probably use concurrent.futures module instead since it provides a nicer interface. Here's a simple example using ThreadPoolExecutor:

import time
from concurrent.futures import ThreadPoolExecutor
import requests

start = time.perf_counter()

urls = [
    "https://i.imgur.com/AD3MbBi.jpeg", 
    "https://i.imgur.com/zYhkOrM.jpeg",
    "https://i.imgur.com/LRoLTlK.jpeg",
    "https://i.imgur.com/gtWsPu9.jpeg",
    "https://i.imgur.com/jDimNTZ.jpeg",
]
def download_image(url):
    print(f"Downloading {url}")
    resp = requests.get(url)
    with open(url.split("/")[-1], 'wb') as f:
        f.write(resp.content)
    print(f"Done downloading {url}")

with ThreadPoolExecutor(max_workers=5) as executor:
    executor.map(download_image, urls)

print(f"Total time: {time.perf_counter() - start}")

Here, a maximum of 5 threads are created and each thread is assigned a url to download. You can see that the requests are fired in parallel and the total time taken is much less than the sequential version. Process pools work in a similar way, only they are heavier but offer more isolation with no GIL constraints.

Downloading https://i.imgur.com/AD3MbBi.jpeg
Downloading https://i.imgur.com/zYhkOrM.jpeg
Downloading https://i.imgur.com/LRoLTlK.jpeg
Downloading https://i.imgur.com/gtWsPu9.jpeg
Downloading https://i.imgur.com/jDimNTZ.jpeg
Done downloading https://i.imgur.com/AD3MbBi.jpeg
Done downloading https://i.imgur.com/zYhkOrM.jpeg
Done downloading https://i.imgur.com/gtWsPu9.jpeg
Done downloading https://i.imgur.com/jDimNTZ.jpeg
Done downloading https://i.imgur.com/LRoLTlK.jpeg
Total time: 0.4605532810019213

When to use it:

  • You don't want to deal with async code and prefer a more familiar interface.
  • If you want to process files after downloading them and have multiple core (using a process pool would be faster than only asyncio)

Gotchas:

  • You might be limited by the number of processes/threads you can create on your system.
  • If you're using some sort of shared state, this would require some manual synchronization.
  • If you're using a large number of processes/threads, you might run into connection limits & memory issues.

2. Asyncio

Another way to achieve this is to use asyncio, a library to write concurrent code that is native to Python (3.4+). This will let you side-step threads if you don't want to deal with them. Note that you can use asyncio within processes/threads as well which might be useful in some cases. Also, Python's requests library is not async by default, so you'll need to use aiohttp instead if you want pure async or httpx if you want a sync/async hybrid.

Here's a simple example using asyncio:

import time
import asyncio
import aiohttp

start = time.perf_counter()

urls = [
    "https://i.imgur.com/AD3MbBi.jpeg", 
    "https://i.imgur.com/zYhkOrM.jpeg",
    "https://i.imgur.com/LRoLTlK.jpeg",
    "https://i.imgur.com/gtWsPu9.jpeg",
    "https://i.imgur.com/jDimNTZ.jpeg",
]

async def download_image(url):
    print(f"Downloading {url}")
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            with open(url.split("/")[-1], 'wb') as f:
                f.write(await resp.read())
    print(f"Done downloading {url}")

async def main():
    await asyncio.gather(*[download_image(url) for url in urls])

asyncio.run(main())

print(f"Total time: {time.perf_counter() - start}")

The above snippet follows a similar behavioral pattern as the thread pool example.

ownloading https://i.imgur.com/AD3MbBi.jpeg
Downloading https://i.imgur.com/zYhkOrM.jpeg
Downloading https://i.imgur.com/LRoLTlK.jpeg
Downloading https://i.imgur.com/gtWsPu9.jpeg
Downloading https://i.imgur.com/jDimNTZ.jpeg
Done downloading https://i.imgur.com/AD3MbBi.jpeg
Done downloading https://i.imgur.com/jDimNTZ.jpeg
Done downloading https://i.imgur.com/zYhkOrM.jpeg
Done downloading https://i.imgur.com/gtWsPu9.jpeg
Done downloading https://i.imgur.com/LRoLTlK.jpeg
Total time: 0.4095943249994889

Pros:

  • No real limit on the number concurrent requests you can make since you're not limited by the number of processes/threads you can create.
  • You can always use asyncio within processes/threads if you need to.

Gotchas:

  • Connection limits & memory issues are still a concern.
  • Not very intuitive if you're not familiar with async code.
  • You'll need to switch to aiohttp or httpx if you want to use async requests.
  • If you're using Jupyter notebooks, you'll need to use nest_asyncio to make it work or run it within an event loop.

References
  • SuperFastPython, an excellent source to get a deep understanding of concurrency in Python.