Tech Talk: Parallel Computing

Showing posts with label Parallel Computing. Show all posts

Friday, July 25, 2025

Ruminating on FastAPI’s Speed and how to scale it with multiple Uvicorn workers

Python’s Global Interpreter Lock (GIL) often raises questions about concurrency and performance, especially for web frameworks like FastAPI. How does FastAPI stay so fast despite the GIL, and how can you run it with multiple workers to fully leverage multi-core CPUs?

Let’s explore these concepts clearly.

The Global Interpreter Lock, or GIL, is a mutex that ensures only one thread executes Python bytecode at any given moment inside a single process. This simplifies memory management and protects Python objects from concurrent access issues. However, it means pure Python threads cannot run code in parallel on multiple CPU cores, limiting how multi-threaded Python programs handle CPU-bound tasks.

This sounds like bad news for a web framework that needs to handle many requests simultaneously, right? Not entirely.

How FastAPI Achieves High Performance Despite the GIL?

FastAPI is designed to handle many simultaneous requests efficiently by leveraging Python’s asynchronous programming capabilities, specifically the async/await syntax.

Asynchronous I/O: FastAPI endpoints can be defined as async functions. When these functions perform I/O operations like waiting for a database query, network response, or file access, they yield control (using await) back to an event loop. This means while one request is waiting, the server can start working on other requests, without the need for multiple threads running in parallel.
Single-threaded event loop: FastAPI runs on ASGI servers like Uvicorn that manage an event loop in a single thread. This avoids the overhead and complexity of thread locking under the GIL because only one thread executes Python code at a time, but efficiently switches between many tasks waiting for I/O.
Ideal for I/O-bound tasks: Web APIs typically spend a lot of time waiting for I/O operations, so asynchronous concurrency lets FastAPI handle many requests without needing multiple CPU cores or threads.

But What If Your Application Is CPU-bound or You Need More Parallelism?

For CPU-bound workloads (heavy calculations) or simply to better utilize multi-core CPUs for handling many requests in parallel, you need multiple processes. This is where Uvicorn’s worker processes come in.

Uvicorn, the ASGI server often used to run FastAPI, supports spawning multiple worker processes via the --workers option. Each worker is a separate process with its own Python interpreter and GIL. Workers run independently and can handle requests concurrently across different CPU cores. The master Uvicorn process listens on a port and delegates incoming requests to the worker processes.

This model effectively bypasses the single-thread GIL limitation by scaling workload horizontally over processes rather than threads (unlike multi-threading in Java or .NET frameworks - e.g. Spring Boot, ASP.NET MVC)

Set the number of workers roughly equal to your CPU cores for optimal utilization. Each worker is a separate process, so memory usage will increase.

When deploying with containers or orchestration tools like Kubernetes, it’s common to run one worker per container and scale containers horizontally.

Please NOTE that 95% of web applications and REST apis are NOT CPU-bound, but I/O bound. So even a single FastAPI server with async programming should more than suffice. Throw-in an additional server with a load balancer for high availability.

But what if you have synchronous libraries and cannot run async in FastAPI? Well FastAPI can handle sync routes also as follows:

When FastAPI routes are defined as synchronous functions (def), the framework handles them by running the route handlers in an external thread pool instead of the main event loop thread. This approach prevents blocking the server's event loop, allowing requests to be processed concurrently despite the synchronous code. The synchronous route is effectively executed on a worker thread managed by the thread pool executor in the underlying Starlette framework.

Each thread releases the Global Interpreter Lock (GIL) when performing blocking I/O operations. This allows other threads to acquire the GIL and run concurrently during I/O waits, improving efficiency in I/O-bound tasks.

While this allows parallel execution, blocking I/O operations in sync routes still consume a thread and can reduce scalability under heavy load. Therefore, sync routes in FastAPI run concurrently but rely on thread-based parallelism rather than true asynchronous non-blocking concurrency as with async def routes. The default number of threads in FastAPI's thread pool for handling synchronous routes is 40.

Saturday, September 08, 2018

Ruminating on Kafka consumer parallelism

Many developers struggle to understand the nuances of parallelism in Kafka. So jotting down a few points that should help from the Kafka documentation site.

Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
Publishers can publish events into different partitions of Kafka. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record).
The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism.

Unlike other messaging middleware, parallel consumption of messages (aka load-balanced consumers) in Kafka is ONLY POSSIBLE using partitions.

Kafka keeps one offset per [consumer-group, topic, partition]. Hence there cannot be more consumer instances within a single consumer group than there are partitions.

So if you have only one partition, you can have only one consumer (within a particular consumer-group). You can of-course have consumers across different consumer-groups, but then the messages would be duplicated and not load-balanced.

Wednesday, April 25, 2012

'volatile' keywork in Java

Found this excellent article on the web explaining the 'volatile' keyword in Java and how it can be used for concurrency. The tutorial also explains the changes to the volatile keyword functioning in Java 5.

Also found it interesting to understand what 'livelock' is? We often encounter dead-lock and thread starvation in parallel programming, but livelock is also possible :)

Difference between Concurrent Collections and Synchronized Collections in JDK

Traditionally, we have also used object locks (semaphores) and synchronized methods to make our collections thread-safe. But having an exclusive lock on an object brings in scalability issues.

Hence the latest versions of JDK have a new package called "java.util.concurrent". This package contains many new collections objects that are thread-safe, but not so because of synchronization :)

More details at this link: http://docs.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/package-summary.html

Snippet from the above link:

The "Concurrent" prefix used with some classes in this package is a shorthand indicating several differences from similar "synchronized" classes. For example java.util.Hashtable and Collections.synchronizedMap(new HashMap()) are synchronized.

But ConcurrentHashMap is "concurrent". A concurrent collection is thread-safe, but not governed by a single exclusion lock. In the particular case of ConcurrentHashMap, it safely permits any number of concurrent reads as well as a tunable number of concurrent writes.

"Synchronized" classes can be useful when you need to prevent all access to a collection via a single lock, at the expense of poorer scalability. In other cases in which multiple threads are expected to access a common collection, "concurrent" versions are normally preferable. And unsynchronized collections are preferable when either collections are unshared, or are accessible only when holding other locks.

Most concurrent Collection implementations (including most Queues) also differ from the usual java.util conventions in that their Iterators provide weakly consistent rather than fast-fail traversal. A weakly consistent iterator is thread-safe, but does not necessarily freeze the collection while iterating, so it may (or may not) reflect any updates since the iterator was created.

Also a good post on Concurrency basics is available at: http://docs.oracle.com/javase/tutorial/essential/concurrency/memconsist.html (All chapters a must read :)

Another good blog that explains how ConcurrentHashMap maintains several locks instead of one single mutex to deliver better performance.

Monday, March 05, 2012

How to ensure that IOCP is used for async operations in .NET?

In my last post, I had blogged about IO Completion Ports and how they work at the OS kernel level to provide for non-blocking IO.

But how can the 'average Joe' developer ensure that IOCP is being used when he uses async operations in .NET?

Well, the good news is that a developer need not worry about the complexities of IOCP as long as he is using the BeginXXX and EndXXX methods of all objects that support async operations. For e.g. SQLCommand has BeginExecuteReader/EndExecuteReader that you can use for asynchronously reading data from a database. FileStream, Socket class all have BeginXXX/EndXXX methods that use IOCP in the background. Under the bonnet, these methods use IO completion ports which means that the thread handling the request can be returned to the threadpool while the IO operation completes.

Some versions of Windows OS may not support IOCP on all devices, but the developer need not worry about this. Depending on the target platform, the .NET Framework will decide to use the IOCompletionPorts API or not, maximizing the performance and minimizing the resources.

An important caveat is to avoid using normal async operations for non-blocking IO - such as "ThreadPool.QueueUserWorkItem, Delegate.BeginInvoke", etc. because these do not use IOCP, but just pick up another thread from the managed thread pool. This defeats the very purpose of non-blocking IO, because then the async thread is drawn from the same process-wide CLR thread pool.

Non blocking IO in .NET (Completion Ports)

Non blocking IO is implemented in Windows by a concept called 'IO Completion Ports' (IOCP).
Using IOCP, we can build highly scalable server side applications that can perform asynchronous IO to deliver maximum throughput for large workloads.

Traditionally server side applications were written by assigning one thread to a socket connection. But this approach seriously limited the number of concurrent connections that a server can handle. By using IOCP, we can overcome the "one-thread-per-client" problem, because 'worker' threads are not blocked for IO. Rather there is a separate pool of IO threads called 'Completion Port Threads' that wait on a special kernel level object called 'Completion Port'.

A completion port is a kernel level object that you can bind with a file handle - either a file stream, database connection or a socket stream. Multiple file handles can be bound to a single completion port. The .NET CLR maintains its own completion port and can bind any file handle to it. Each completion port has a queue associate with it. Once a IO operation completes, a message (completion packet) is posted to the queue. IO threads block or 'wait' on this completion port queue, till a message is posted. The waiting IO threads (a.k.a completion port thread) pick up the messages in the queue in FIFO order. Hence any thread may handle any completion message packet. It is important to note that threads are 'woken' in a LIFO order, so chances are that caches are still warm.

The following links throw more light on this:
http://blog.stevensanderson.com/2008/04/05/improve-scalability-in-aspnet-mvc-using-asynchronous-requests/
http://www.codeproject.com/Articles/1052/Developing-a-Truly-Scalable-Winsock-Server-using-I

Why does the .NET Thread Pool have a separate worker thread pool and a Completion Port pool?
I believe that technically there is no fundamental difference in the nature of the threads associated with each pool. Worker threads are meant to do active work, where as Completion Port threads are meant to wait on completion ports. Since IO threads wait on CPs, they may block for longer periods of time. Hence the .NET framework has created separate categories for them. If there was a single pool, then there could be a situation where a high demand on worker threads exhausts all the threads available to dispatch native I/O callbacks,
potentially leading to deadlock.

Looks like in IIS 7, the threading model has undergone drastic changes. More info available here.

Wednesday, February 15, 2012

Using Parallelism in .NET WinForm applications

We all have gone through the travials of multi-threaded programming in WinForm applications. The challenge in WinForm applications is that the UI controls are bound to the thread that created/rendered them; i.e. the UI control can only by updated by the main thread or the GUI thread that created it.

But to keep the UI responsive, we cannot execute any long running task (>0.5 sec) on the UI thread, else the GUI would hang or freeze. If we run the business logic asynchronously on another thread, then how do we pass the results back to the main GUI thread to update the UI?

Traditionally this has been done using the Control.Invoke() methods. More details on this approach is available on this link: http://msdn.microsoft.com/en-gb/magazine/cc300429.aspx

But with the introduction of TPL, there is another alternative way of doing this. We can use the TaskScheduler and SynchronizationContext classes to call heavy lifting work and then pass the results to the main GUI thread.

For e.g.

TaskScheduler uiScheduler = 
           TaskScheduler.FromCurrentSynchronizationContext();
new Task({Your code here}).start(uiScheduler);

Given below are 2 excellent articles eloborating this in detail:
http://www.codeproject.com/Articles/152765/Task-Parallel-Library-1-of-n

http://reedcopsey.com/2010/03/18/parallelism-in-net-part-15-making-tasks-run-the-taskscheduler/

Sacha Barber has an excellent 6 series article on the intricacies of TPL, which I loved reading.

Parallelism in .NET

In one of my previous blogs, I had pointed out to an interesting article that shows how TPL controls the number of threads in the Thread Pool using hill-climbing heuristics.

In order to understand why TPL (Task Parallel Library) is far superior to simple muli-threading, we need to understand the concepts of global queue, local queue on each thread, work-stealing algorithms, etc.
Given below are some interesting links that explain these concepts with good illustrations.

http://www.danielmoth.com/Blog/New-And-Improved-CLR-4-Thread-Pool-Engine.aspx

http://blogs.msdn.com/b/jennifer/archive/2009/06/26/work-stealing-in-net-4-0.aspx

http://udooz.net/blog/2009/08/net-4-0-work-stealing-queue-plinq/

A few important points to remember:

There is one global queue for the default Thread Pool in .NET 4.0
There is also a local queue for each Thread. The Task Scheduler distributes the tasks from the global queue to the local queues on each Thread. Even sub-tasks created by each Thread get queued on the local queue. This improves the performance, as there is no contention to pick up work items (tasks) from the global queue; especially in a multi-core scenario.
If a thread is free and there are no tasks in its local queue and also global queue, then it will steal work from other threads. This ensures that all cores are optimally utilized. This concept is called 'work stealing'.
Tasks from the global queue are picked up in 'FIFO' order. Tasks from the local queue are picked up in 'LIFO' order based on the assumption that the last-in is still hot in the cache. Work stealing again happens in 'FIFO' order.

There is a wonderful book on parallel computing available on MSDN that is a must read for everyone.

Thursday, August 11, 2011

How does .NET TPL control the number of threads

I often wondered what heuristics the Task Parallel Library (TPL) in .NET uses to control the number of threads for optimal utilization on multi-core machines.

Found a great discussion thread on StackOverFlow explaining the details.

Tech Talk