Friday, July 25, 2025

Ruminating on FastAPI’s Speed and how to scale it with multiple Uvicorn workers

Python’s Global Interpreter Lock (GIL) often raises questions about concurrency and performance, especially for web frameworks like FastAPI. How does FastAPI stay so fast despite the GIL, and how can you run it with multiple workers to fully leverage multi-core CPUs? 

Let’s explore these concepts clearly.

The Global Interpreter Lock, or GIL, is a mutex that ensures only one thread executes Python bytecode at any given moment inside a single process. This simplifies memory management and protects Python objects from concurrent access issues. However, it means pure Python threads cannot run code in parallel on multiple CPU cores, limiting how multi-threaded Python programs handle CPU-bound tasks.

This sounds like bad news for a web framework that needs to handle many requests simultaneously, right? Not entirely.

How FastAPI Achieves High Performance Despite the GIL?

FastAPI is designed to handle many simultaneous requests efficiently by leveraging Python’s asynchronous programming capabilities, specifically the async/await syntax.

  • Asynchronous I/O: FastAPI endpoints can be defined as async functions. When these functions perform I/O operations like waiting for a database query, network response, or file access, they yield control (using await) back to an event loop. This means while one request is waiting, the server can start working on other requests, without the need for multiple threads running in parallel.
  • Single-threaded event loop: FastAPI runs on ASGI servers like Uvicorn that manage an event loop in a single thread. This avoids the overhead and complexity of thread locking under the GIL because only one thread executes Python code at a time, but efficiently switches between many tasks waiting for I/O.
  • Ideal for I/O-bound tasks: Web APIs typically spend a lot of time waiting for I/O operations, so asynchronous concurrency lets FastAPI handle many requests without needing multiple CPU cores or threads.

But What If Your Application Is CPU-bound or You Need More Parallelism?

For CPU-bound workloads (heavy calculations) or simply to better utilize multi-core CPUs for handling many requests in parallel, you need multiple processes. This is where Uvicorn’s worker processes come in. 

Uvicorn, the ASGI server often used to run FastAPI, supports spawning multiple worker processes via the --workers option. Each worker is a separate process with its own Python interpreter and GIL. Workers run independently and can handle requests concurrently across different CPU cores. The master Uvicorn process listens on a port and delegates incoming requests to the worker processes.

This model effectively bypasses the single-thread GIL limitation by scaling workload horizontally over processes rather than threads (unlike multi-threading in Java or .NET frameworks - e.g. Spring Boot, ASP.NET MVC)

Set the number of workers roughly equal to your CPU cores for optimal utilization. Each worker is a separate process, so memory usage will increase.

When deploying with containers or orchestration tools like Kubernetes, it’s common to run one worker per container and scale containers horizontally.

Please NOTE that 95% of web applications and REST apis are NOT CPU-bound, but I/O bound. So even a single FastAPI server with async programming should more than suffice. Throw-in an additional server with a load balancer for high availability. 

But what if you have synchronous libraries and cannot run async in FastAPI? Well FastAPI can handle sync routes also as follows:

When FastAPI routes are defined as synchronous functions (def), the framework handles them by running the route handlers in an external thread pool instead of the main event loop thread. This approach prevents blocking the server's event loop, allowing requests to be processed concurrently despite the synchronous code. The synchronous route is effectively executed on a worker thread managed by the thread pool executor in the underlying Starlette framework. 

Each thread releases the Global Interpreter Lock (GIL) when performing blocking I/O operations. This allows other threads to acquire the GIL and run concurrently during I/O waits, improving efficiency in I/O-bound tasks. 

While this allows parallel execution, blocking I/O operations in sync routes still consume a thread and can reduce scalability under heavy load. Therefore, sync routes in FastAPI run concurrently but rely on thread-based parallelism rather than true asynchronous non-blocking concurrency as with async def routes. The default number of threads in FastAPI's thread pool for handling synchronous routes is 40

Friday, July 04, 2025

Ruminating on Continued Pre-Training and Fine-Tuning of LLMs

In my previous blogpost, we discussed about the differences between RAG and Fine-Tuning. Besides fine-tuning, there is another technique called as "Continued Pre-Training" that can be used to improve the performance of LLMs. 

Continued pre-training involves taking a pre-trained model—typically trained on a large, general dataset .....and further training it on a new, often domain-specific dataset. The goal is to adapt the model’s general knowledge to a specific domain, such as medical texts, legal documents, or scientific literature, without starting from scratch. This enhances the model’s understanding of a specific domain while retaining its general knowledge.

Suppose you have a pre-trained language model like BERT, originally trained on a general corpus like Wikipedia and BookCorpus. You want to use it for analyzing medical research papers. Since BERT’s general training may not capture medical jargon or context, you perform continued pre-training. 

To do this, you gather a large dataset of medical texts, such as PubMed articles or clinical notes. Fine-tune BERT’s weights on the medical corpus, allowing it to learn medical terminology and context. The new model (call it  “MedicalBERT”) has adapted to medical terminology and can better understand domain-specific texts.

Other examples of continued pre-training:

  • Adapting a Language Model for Legal Documents: You have a pre-trained model like RoBERTa, trained on general web data, but you need it to understand legal terminology and context for analyzing contracts or court documents.
  • Adapting a Vision Model for Satellite Imagery: A pre-trained vision model like ResNet, trained on ImageNet (general images like animals and objects), needs to be adapted for analyzing satellite imagery for urban planning or environmental monitoring.

Fine-tuning takes a pre-trained model (or a model after continued pre-training) and trains it on a smaller, task-specific dataset to optimize it for a particular task, such as classification, translation, or question answering. Fine-tuning adjusts the model’s weights to improve performance on the target task while leveraging the general knowledge learned during pre-training.

Examples of fine-tuning:

  • Fine-Tuning for Object Detection in Medical Imaging: You want to use a pre-trained vision model like YOLOv5, adapted for medical imaging (e.g., via continued pre-training on X-ray images), to detect specific abnormalities like tumors in chest X-rays.

Given below is a comparison table for RAG vs Continued-Pretraining vs Fine tuning

Aspect

Retrieval-Augmented Generation (RAG)

Continued Pre-Training

Fine-Tuning

Definition

Combines a pre-trained language model with a retrieval mechanism to fetch relevant external documents for generating contextually accurate responses.

Further trains a pre-trained model on a large, domain-specific dataset to adapt it to a particular domain.

Optimizes a pre-trained model for a specific task using a smaller, labeled dataset in a supervised manner.

Objective

Enhance model responses by incorporating external knowledge dynamically during inference.

Adapt a model to understand domain-specific patterns, terminology, or context.

Optimize a model for a specific task, such as classification or translation.

Data Requirement

Requires a large corpus of documents for retrieval (often unstructured) and a pre-trained model.

Requires a large, domain-specific dataset, typically unlabeled or weakly labeled.

Requires a smaller, task-specific, labeled dataset.

Learning Type

Combines generative modeling with retrieval; no additional training required during inference.

Self-supervised or unsupervised learning (e.g., masked language modeling).

Supervised learning with task-specific objectives (e.g., classification loss).

Process

Retrieves relevant documents from an external knowledge base and uses them as context for the model to generate responses.

Continues training the model on domain-specific data to update its weights broadly.

Updates model weights specifically for a target task using labeled data.

Computational Cost

Moderate; requires efficient retrieval systems but no additional training during inference.

High; involves training on large datasets, requiring significant compute resources.

Moderate to low; uses smaller datasets, but may require careful tuning to avoid overfitting.

Data Availability

Needs a well-curated, accessible knowledge base for retrieval (e.g., Wikipedia, company documents).

Requires a large, domain-specific corpus, which may be hard to obtain for niche domains.

Needs labeled data, which can be costly or time-consuming to annotate.

Model Modification

No modification to model weights; relies on external knowledge for context.

Broad updates to model weights to capture domain-specific knowledge.

Targeted updates to model weights for task-specific performance.

Scalability

Scales well with large knowledge bases, but retrieval quality affects performance.

Scales with data and compute resources; time-consuming for large datasets.

Scales with labeled data availability; risk of overfitting with small datasets.