By Nikos Vaggalis - journalist at i-programmer and software engineer. Read more of his work here.
This article explains concurrency in Python including topics like multithreading, multiprocessing, race conditions, and synchronization mechanisms such as locks. We’ll then take a deep dive into switching off GIL to enable real multithreading in Python, highlighting the differences, the benefits and the gotchas with clear code examples.
Introduction
You might be wondering why you’ll need concurrency at all. In most case you won’t; but you will if you are involved with (for example):
- Data processing and ETL - parsing massive text files, cleaning up messy data, or applying complex regular expressions to millions of rows;
- Cryptography and hashing - reading files and calculating a cryptographic hash (like SHA-256) for every single one;
- Data science - running Monte Carlo simulations requiring heavy math operations such as those provided by NumPy, Pandas, and Scikit-Learn; or
- Network operations - downloading files, scraping web sites or calling REST APIs.
But first, let’s get our terminology straight; concurrency, parallelism and multi-threading. Terms similar in nature which therefore confuse! So let’s look at them in simple terms.
Sequential - a single process doing one thing at a time, and waiting for it to complete before commencing the next process.
Concurrency - a single process managing multiple things at once, but not necessarily doing them simultaneously. e.g. . A chef (the process) chops onions, puts the soup on to boil, switches to seasoning the meat while the soup boils, starts frying the onions, and then cutting the meat while the onions are frying and the soup is boiling
Event Loop - A control structure that continuously waits for events (such as I/O completion, timers, or user actions), dispatches the associated tasks, and then repeats.
Parallelism/Multiprocessing - doing multiple things at the exact same time. e.g. two chefs (processes), one chops onions and the other chops tomatoes at the same time.
Multithreading - a programming model where a single process spawns multiple separate threads of execution, to achieve concurrency. All threads share the exact same memory space and resources. Thinking about it, it is also parallelism but in the same process (sharing memory) rather in separate processes.
Thread safety is the property of a program or system that allows multiple threads to concurrently access and modify shared memory and resources without causing data corruption, memory leaks, or fatal program crashes.
When multiple threads operate within the same memory space, there is a risk that they will attempt to read and write to the same data simultaneously. If this access is not carefully synchronized, it leads to a race condition, meaning the final result depends unpredictably on the exact timing of when each thread executes its tasks.
To achieve thread safety and prevent these race conditions, developers and programming languages rely on various synchronization mechanisms like mutexes and locks and atomic operations.
The Global Interpreter Lock (GIL) - This is a (default) setting of Python which prevents multithreading.
Free-threading - Removing the GIL to enable multithreading in Python.
Multiple CPU core - A CPU core is an individual processing unit within a computer’s processor (CPU) that reads and executes instructions independently. Modern CPUs usually feature multiple CPU cores which enable parallelism.
Example: Parallelism/Multiprocessing
Just run this command in two terminals:
python -c "while True: pass"
(using ctrl-C to stop it when the experiment is over)
If your computer has multiple CPU cores, then each of these separate processes will use a different core. This isn’t a very interesting example of parallel concurrency because there is nothing shared between the two processes.
Example: Non-parallel Concurrency
Let’s step back and find a more practical use of concurrency where a function's code should be run concurrently - in this case, the worker function.
import asyncio
async def worker(name, delay):
print(f"{name} starting")
await asyncio.sleep(delay)
print(f"{name} finished after {delay} seconds")
async def main():
task1 = asyncio.create_task(worker("task1", 1))
task2 = asyncio.create_task(worker("task2", 1))
task3 = asyncio.create_task(worker("task3", 1))
await task1
await task2
await task3
asyncio.run(main())
The asyncio package runs entirely on a single thread, it only has one physical CPU core available to it. It relies on the fact that the tasks involve I/O operations (like waiting for a network, a database, or a timer like asyncio.sleep()). It only works because the CPU is essentially doing nothing during that waiting period, allowing the event loop to feed it another task. Therefore under asyncio when an I/O blocking operation is hit, the code explicitly yields control back to the event loop allowing the loop to efficiently run other threads in the background rather than sitting idle.
If you remove asyncio, then each task would run sequentially, one after the other, taking three times more than the asyncio code.
At this point it’s important to note that the GIL severely restricts CPU-bound code, but it has very little negative impact on I/O-bound code in the example above. CPU-bound tasks are those limited by the raw speed of your processor, such as heavy mathematical computations, image manipulation, or complex data processing. Because the GIL ensures only one thread can execute Python bytecode at a time, it completely prevents CPU-bound threads from running in parallel across multiple CPU cores. Attempting to use multithreading for CPU-bound tasks (as per the standard setup with GIL switched on) can actually degrade your program’s performance. This slowdown occurs because the threads continuously fight for control of the GIL, leading to significant context-switching overhead and resource contention.
Conversely, I/O-bound tasks spend the vast majority of their time waiting for external operations to finish, such as downloading data from the internet, querying a database, or reading and writing files to a hard drive. The GIL does not prevent these operations from occurring concurrently. Whenever a Python thread initiates a blocking I/O operation, it voluntarily releases the GIL. This allows the interpreter to hand the lock over to another thread, which can actively execute Python code while the first thread waits in the background for its data to arrive. Because of this cooperative multitasking, multithreading is a highly effective strategy for speeding up I/O-bound Python programs.
Free-Threaded Concurrency
Free-threaded concurrency is the way we can have multiple threads running concurrently by removing the Global Interpreter Lock (GIL).
By default the GIL prevents multithreading on multiple CPU/multi-core processors, forcing developers to use multiprocessing or C extensions with multithreading for performance-heavy tasks. The removal of the Global Interpreter Lock (GIL) in Python enables true multithreaded parallelism.
The Global Interpreter Lock (GIL)
But let’s take it from the beginning. What is the Global Interpreter Lock, and why is it there?
The Global Interpreter Lock is a mutex (mutual exclusion lock) that allows only one native thread to execute Python bytecode at a time within a single process. Because of the GIL, Python threads cannot achieve true parallel execution on multiple CPU cores; instead, they take turns sharing the processor through cooperative or preemptive multitasking.
The GIL was originally implemented in the early 1990s because it provided a straightforward way to ensure thread safety for Python’s internal memory management (specifically reference counting) and made it much easier to integrate with C libraries that were not thread-safe. By not requiring multiple granular locks for every data structure, the GIL also kept single-threaded programs running extremely fast.
Despite its early benefits, the GIL can now (optionally) be switched off because it has become a massive bottleneck for high-performance computing, artificial intelligence, and machine learning. The primary reasons driving its removal are:
- Inability to leverage modern hardware. Modern computers rely heavily on multi-core processors, but the GIL prevents CPU-bound Python programs from fully utilizing these cores. When multiple threads try to perform heavy computations in Python, they end up fighting for the GIL. This causes significant context-switching overhead and can actually degrade performance compared to using just a single thread.
- The severe drawbacks of multiprocessing. To work around the GIL, developers have historically relied on parallelism/multiprocessing—spawning entirely separate system processes instead of threads. However, multiprocessing comes with severe penalties; creating processes consumes far more memory than threads, and sharing data between processes requires expensive data serialization and inter-process communication.
- Code complexity and C++ rewrites. The inability to easily run parallel threads has forced developers to maintain complex, brittle workarounds. In many cases, organizations are forced to translate large portions of their Python codebases into C or C++ just to achieve the necessary performance.
Through PEP 703, Python has officially introduced an experimental “free-threaded” build starting in Python 3.13, which allows developers to run Python with the GIL disabled. To make this possible without crashing the interpreter, Python’s internals are being completely overhauled with new thread-safe mechanisms. These include a new memory allocator called mimalloc, “immortalizing” certain objects so they don’t require reference counting, and using advanced techniques like biased and deferred reference counting to prevent threads from locking each other up.
Installing the free-threaded Python (i.e Python without the GIL)
Here’s a quick and easy installation in a virtual environment to avoid making changes to your normal setup using the uv package manager.
On MacOS or Linux:
# install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# logout and login again
# check that uv has been installed
uv --version
# install your free threaded Python in a virtual environment
uv venv --python 3.14t
# activate the virtual environment
source .venv/bin/activate
On Windows:
# install uv
winget install --id=astral-sh.uv -e
# or if you don't have winget installed:
powershell -ExecutionPolicy ByPass -c `
"irm https://astral.sh/uv/install.ps1 | iex"
# logout and login again
# check that uv has been installed
uv --version
# install your free threaded Python in a virtual environment
uv venv --python 3.14t
# activate the virtual environment
.venv\Scripts\activate
To check that you have the right Python installed, run:
python -VV
The output should be something like:
Python 3.14.4 free-threading build (main, Apr 14 2026, 14:35:29)
Inside your script, you can call sys._is_gil_enabled(). It will return False if the GIL is turned off.
$ python -c 'import sys; print(sys._is_gil_enabled())'
False
You can switch GIL back on with the -X command-line parameter like this:
$ python -X gil=1 -c 'import sys;
print(sys._is_gil_enabled())'
True
And you can see whether the version of Python you’ve installed is capable of switching GIL on or off by seeing whether it has the config variable Py_GIL_DISABLED
$ python -c 'import sysconfig;
print(sysconfig.get_config_var("Py_GIL_DISABLED"))'
1
$ python -X gil=1 -c 'import sysconfig;
print(sysconfig.get_config_var("Py_GIL_DISABLED"))'
1
Note: If you want all the Python code on the server to be using the free threaded Python, you can do a system level installation following the instructions here.
Liability Waiver
Before we start multi-threading with GIL-free Python, please keep the following caveats in mind:
Race conditions. If you do not carefully synchronize the order in which threads read and write to shared data, the final result will depend entirely on the unpredictable timing of the threads. These race conditions lead to incorrect outputs and are notoriously difficult to reproduce and debug.
Memory leaks and interpreter crashes. Python internally relies on reference counting to manage memory. If multiple threads concurrently increment and decrement an object’s reference count without thread-safe mechanisms, the count can become corrupted, leading to memory leaks or sudden program crashes.
Deadlocks. To prevent race conditions, developers rely on locks. However, this introduces the risk of lock ordering deadlocks. This happens when multiple threads attempt to acquire the same set of locks but in different orders, causing them to freeze indefinitely as they wait on one another to release a lock.
Unsafe iterators. Specifically in Python’s new free-threaded builds, concurrently accessing the same iterator from multiple threads is not thread-safe and may cause your program to duplicate or miss the processing of elements.
C-extension vulnerabilities. For developers using C-extensions, relying on borrowed references (temporarily using an object from a list or dictionary without taking formal ownership of it) is highly dangerous. Without the GIL, a separate thread could delete or modify the object inside the collection while the first thread is still trying to read it. Additionally, improperly sharing raw C-pointers between threads can easily trigger segmentation faults and data corruption. Examples of C-extension packages with which you should be careful of running in free-threaded Python include NumPy, Pandas, and Scikit-Learn.
Number of CPU cores. The examples below are tailored to a computer with 8 CPU cores. To get an accurate measurement of performance of the example scripts below, modify THREADS to match the number of cores on the machine where you’re running them.
Example: Embarrassingly Parallel Multithreading
The great thing about Python’s transition to a GIL-free architecture is that pure Python code (code that does not extend to C libraries) does not need to be modified or rewritten to run on the free-threaded version. The exact same multithreaded code works in both environments. The difference lies entirely in which Python interpreter you use, and the environment flags you set when executing the script.
Here is a standard multithreaded Python example that uses threading.Thread to execute a CPU-bound task (calculating Fibonacci numbers) across four threads:
import threading
import time
THREADS = 8
def fib(n):
if n <= 1:
return n
return fib(n - 1) + fib(n - 2)
def main():
start_time = time.time()
threads = []
# Create and start 8 threads for a CPU-bound task
for i in range(THREADS):
thread = threading.Thread(target=fib, args=(35,))
thread.start()
threads.append(thread)
# Wait for all threads to finish
for thread in threads:
thread.join()
print(f"Completed: {time.time() - start_time:.2f} seconds.")
if __name__ == "__main__":
main()
The benchmarks are eye opening. Running the same code under a free-threaded build yields:
$ python fib.py
Completed: 0.81 seconds.
while with the GIL enabled:
$ python -X gil=1 fib.py
Completed: 4.13 seconds.
Due to the GIL, performance is roughly four times slower. This is because the GIL prevents multiple threads from executing Python bytecode at the same time, hence the threads will continuously interrupt each other and fight for the lock. The program will run serially and utilize only about 12.5% of a 8-core CPU’s capacity, acting exactly like a single-threaded program.
Without the GIL, the CPU cores will immediately run the threads in parallel. You will see a massive drop in execution time on an 8-core machine, because the threads are no longer waiting on a global lock to execute their instructions.
That said, a warning on thread safety. Even though the code syntax doesn’t change, the safety of your existing code might. This is because there’s a common misconception that the GIL had made all Python code inherently thread-safe. In reality, the GIL primarily protected Python’s internal memory management and state, but it did not guarantee that your high-level application operations were atomic or safe from race conditions.
Let’s revisit the Fibonacci code example. The recursive fib(n) function is purely a computationally expensive mathematical operation used to benchmark CPU-bound performance. Each thread executes this function completely independently, relying only on its own isolated call stack and local inputs, and therefore it is thread safe. Because the threads do not modify any shared data or state, it does not require any synchronization.
Synchronization primitives (such as locks) are specifically required to coordinate access to shared resources, preventing race conditions that occur when multiple threads try to read and write to the same memory space simultaneously.
Because the threads in the example above do not exchange data, write to shared global variables, or merge partial results, they do not step on each other’s toes.
When tasks can be performed completely independently of one another without requiring any data exchange, they are often referred to as “embarrassingly parallel”. Since there is no shared state to protect in this scenario, adding a lock is entirely unnecessary.
Example: Parallel Multithreading with Shared Resources - Buggy
Let’s now take a look where things can go awry, using a simple parallel number summarizer.
counter_buggy.py
import threading
import time
COUNT = 1_000_000
THREADS = 8
counter = 0
def worker():
global counter
for _ in range(COUNT):
counter += 1
threads = [
threading.Thread(target=worker)
for _ in range(THREADS)
]
start_time = time.time()
for t in threads:
t.start()
for t in threads:
t.join()
print("Time:", time.time() - start_time)
print("Counter:", counter)
print("Expected:", COUNT * THREADS)
What the code is doing
We spin 8 threads and each thread increments counter 1,000,000 times:
counter += i
What should happen (correct result) if done safely:
counter = 0
+ 1 000 000 (Thread A)
+ 1 000 000 (Thread B)
+ ...
+ 1 000 000 (Thread H)
= 8 000 000
The final result should be 8 000 000.
Let’s check that. First of all, with GIL switched on:
$ python -X gil=1 counter_buggy.py
Time: 0.17042112350463867
Counter: 8000000
Expected: 8000000
Now, letting the threads run in parallel:
$ python counter_buggy.py
Time: 0.27542805671691895
Counter: 1105222
Expected: 8000000
The counter is wrong!!
What actually happens (race condition):
The key is that:
counter += 1
is NOT one step. It expands to:
1. Read counter
2. Add 1
3. Write counter back
Interleaving is what causes the bug:
Step-by-step execution:
Initial state: counter = 0
Thread A reads counter : 0
Thread B reads counter : 0 (both read same old value)
Thread A computes 0 + 1 = 1
Thread B computes 0 + 1 = 1
Thread A writes counter = 1
Thread C reads counter = 1
Thread C computes 1 + 1 = 2
Thread C writes counter = 2
Thread B writes counter = 1
We expected:
0 + 1 + 1 + 1 = 3
But instead:
Thread B ignored Thread A’s update
This happens because both threads read before either writes as there’s no synchronization. This is exactly what a race condition is; the outcome depends on the timing (“race”) between threads
Cool, so GIL protects me from those conditions?
The GIL sometimes hides this because in CPython threads don’t run truly in parallel and context switches happen less aggressively, so this exact interleaving happens less often but is still possible.
In no-GIL Python, threads run simultaneously therefore this interleaving becomes common and easy to reproduce.
The mental model to remember is to think of counter += 1 as:
“Read → Compute → Write”
If two threads do that at the same time, they can both read the same old value and one update gets overwritten. Hence the Golden Rule to stick by is:
If multiple threads access shared mutable data and at least one writes, then use synchronization.
Example: Parallel Multithreading with Shared Resources - Simple Fix
With a multi-threaded program updating a shared counter, you need to use synchronization primitives like threading.Lock to protect your application’s shared data and logic, just as you did when the GIL was present. If your code required locks or queues for thread safety before, it will still require them in the free-threaded version. Let’s now check an example:
import threading
import time
COUNT = 1_000_000
THREADS = 8
counter = 0
lock = threading.Lock() # FIX A: create a lock object
def worker():
global counter
for _ in range(COUNT):
# FIX B: lock this block so only 1 thread can run it
with lock:
counter += 1
threads = [
threading.Thread(target=worker)
for _ in range(THREADS)
]
start_time = time.time()
for t in threads:
t.start()
for t in threads:
t.join()
print("Time:", time.time() - start_time)
print("Counter:", counter)
print("Expected:", COUNT * THREADS)
Running it now, you see that the Counter is correct both with GIL switched on and off.
$ python -X gil=1 counter_fixed.py
Time: 0.5505349636077881
Counter: 8000000
Expected: 8000000
$ python counter_fixed.py
Time: 0.8671579360961914
Counter: 8000000
Expected: 8000000
But wait! The GIL free version took almost twice as long to execute it? What’s happening? Where’s the promise of ultra fast speeds? This is completely counterintuitive!
It makes perfect sense to expect that turning off the GIL would instantly make your multi-threaded code run faster. However, the specific code highlights one of the biggest paradoxes in parallel programming; removing the GIL does not magically make serialized code parallel, and it can actually make it worse.
The reason the non-GIL version is nearly three times slower comes down to two main factors, extreme lock contention and CPU cache bouncing.
Here is exactly what is happening under the hood.
1. The Code is Inherently Sequential.
Take a close look at the core loop:
for _ in range(COUNT):
with lock:
counter += 1
Because of the with lock: statement, only one thread can ever modify counter at a time. Even if you have 100 CPU cores, 99 of them will be paused, waiting in line for the 1 core that currently holds the lock. There is actually zero parallel work happening in this specific task.
2. Execution with the GIL (not parallel, but fast)
When the GIL is enabled, Python acts as a central traffic cop. The GIL ensures that only one thread executes Python bytecode at any given moment. Because the GIL is already preventing multiple threads from running simultaneously on multiple cores, the contention for lock is actually quite low. Python’s internal thread switching handles the hand-offs relatively smoothly. The threads aren’t fiercely fighting over the lock at the operating system level because the GIL is keeping them somewhat orderly.
3. Execution without the GIL (parallel, but slow)
When you disable the GIL, Python takes the training wheels off and hands control directly to your Operating System. Now, the OS puts your 8 threads onto 8 separate, physical CPU cores and yells “Go!”
All 4 cores instantly rush to acquire lock at the exact same fraction of a millisecond. Because the lock is highly contested (8 million times in total), the operating system has to constantly intervene, putting threads to sleep and waking them up. This OS-level context switching is incredibly heavy and expensive compared to Python’s internal GIL management.
The other reason is of the Cache Bouncing occurring at the hardware level, meaning that the CPU cores spend more time aggressively syncing and invalidating each other’s memory caches (cache bouncing or ping-ponging) than actually doing the math.
Example: Parallel Multithreading with Shared Resources - Proper Fix
Disabling the GIL is not a silver bullet for performance. The “No-GIL” version shines in CPU-bound tasks that do not share state (the “embarrassingly parallel” workloads). If your threads are doing heavy math on independent variables, the No-GIL version will be significantly faster. But if your threads are constantly fighting over a single, shared, locked variable, the No-GIL version will suffer from severe hardware-level traffic jams. As such, the secret to making multi-threaded code fly—especially without the GIL—is to eliminate shared state.
Having looked at this issue, let’s rewrite the code so that it actually leverages multiple cores and runs dramatically faster without the GIL. To do so, instead of having all four threads fight over a single, locked counter, we will give each thread its own local counter. They will do their work completely independently, and we will simply add up their final tallies at the very end.
We’re going to do so by utilizing Python’s concurrent.futures module, which is the cleanest way to handle this:
from concurrent.futures import ThreadPoolExecutor
import time
COUNT = 1_000_000
THREADS = 8
def worker():
local_counter = 0
for _ in range(COUNT):
local_counter += 1
return local_counter
start_time = time.time()
with ThreadPoolExecutor(max_workers=THREADS) as executor:
results = list(executor.map(
lambda _: worker(), range(THREADS)
))
counter = sum(results)
print("Time:", time.time() - start_time)
print("Counter:", counter)
print("Expected:", COUNT * THREADS)
Not only is it correct, but it’s about five times as fast as the buggy code!
12:23 $ python counter_fixed_fast.py
Time: 0.05626487731933594
Counter: 8000000
Expected: 8000000
12:23 $ python -X gil=1 counter_fixed_fast.py
Time: 0.10117506980895996
Counter: 8000000
Expected: 8000000
Wow, talking about improvement! Why is this version drastically better? Firstly, because of zero lock contention. There’s no threading.Lock(), (as we used in the quick fix) the operating system never has to pause a thread to wait for another one. In comparison with the buggy code, there’s zero cache bouncing. Each thread is updating a variable (local_counter) that lives in its own isolated CPU cache. The CPU cores don’t have to waste time synchronizing memory with the global variable.
Running this specific code in a “free-threaded” Python environment, is being done in true parallelism and will scale beautifully. The cores will sprint through their individual loops simultaneously, and the execution time will drop significantly.
The Golden Rule of Parallelism
Whenever you are trying to speed up code using multiple cores, always ask yourself: “Do these threads need to talk to each other right now?” If the answer is yes, it will be slow. The best parallel code splits a big job into completely isolated chunks, processes them separately, and merges the results at the finish line.
Most pure Python code will not need to be rewritten or touched. The free-threaded architecture was specifically designed to be highly compatible with existing code. Operations that were considered atomic under the GIL continue to be atomic in the free-threaded version. Therefore, as long as a pure Python library was already properly thread-safe (using standard synchronization tools like threading.Lock where appropriate), it will remain thread-safe without the GIL.
But what about the Python code/libraries that rely on underlying C libraries. Shouldn’t they be re-written in order to become thread safe and be safely used on a disabled GIL Python?
Historically, the Python C API gave developers direct control over the GIL, and many native extensions implicitly relied on the GIL to protect global data structures and object states within their C code. Libraries that rely on native C extensions—which includes most data science, AI, and performance-critical libraries like NumPy, Pandas and scikit-learn — are not yet fully thread-safe.
What happens if a library hasn’t been rewritten yet? Python includes a built-in safety net. If you import a third-party C-API extension package that has not been explicitly marked as supporting free-threading, the Python interpreter will automatically re-enable the GIL at runtime and print a warning. This prevents legacy packages from crashing your program, though it temporarily removes your multi-core performance gains.
The good news is that the transition is already underway. Organizations like Meta and Quansight are actively trying to add free-threading compatibility to the most popular packages in the Python ecosystem, while websites like Python Free-Threading Guide have even been set up to track the compatibility status of popular libraries.
To conclude, GIL free Python truly enhances performance multi-folds, but it’s not a fit-all solution; removing the GIL does not magically make serialized code parallel, and it can actually make it worse. You’ve got to understand how to structure your code to take advantage of it.
Further Exploration
If you want to stretch the capabilities of the new free threading system, on Python Free-Threading Guide there are some outstanding examples where the absence of the GIL works wonders. One particular example doing Web Scraping blends multi-threading with asyncio, two concepts that initially look entirely orthogonal!
“Web scraping is the process of extracting useful data from websites, and it becomes especially challenging and time-consuming when dealing with hundreds or thousands of pages. The traditional synchronous approach scrapes only one page at a time and is slow. With asyncio, we can leverage asynchronous I/O to scrape multiple pages concurrently, which significantly speeds up the process; however, asyncio can only utilize a single CPU core. Modern computers have multiple CPU cores; yet, asyncio only takes advantage of a single core. However, with free-threaded Python, we can run multiple asyncio workers in threads to utilize all available cores.”
