Linux kernel memory management Linux kernel memory management

The Secret Life of Memory: Linux Kernel Memory Management Explained Simply

Introduction: Welcome to the City of Linux

Imagine a sprawling, ever-growing city. Millions of citizens (processes) live and work in it at any given time. Some are small-time shopkeepers using a tiny room. Others are massive corporations occupying entire skyscrapers. Every one of them needs space — land, offices, storage.

But here’s the thing: the city doesn’t have infinite land.

The physical land is limited. It’s finite. It’s real. And someone — someone very smart — has to decide who gets what piece of land, when they get it, what happens when the land runs out, and what to do when a citizen starts hoarding space they don’t need.

That “someone” is the Linux Kernel’s Memory Management subsystem.

It is, without exaggeration, one of the most sophisticated resource managers ever designed by human beings. And today, we’re going to walk through its streets, peek behind its curtains, and understand — in plain, human language — how it works.

Grab your coffee. Let’s go.

Chapter 1: The Illusion — Virtual Memory (The City’s Greatest Magic Trick)

Before we dive into the nuts and bolts, let’s start with the kernel’s greatest trick: making every process believe it has the entire city to itself.

The Grand Illusion

When a process starts on Linux — say, your web browser — it doesn’t see the real physical memory (RAM). Instead, the kernel hands it a virtual address space. On a 64-bit system, this virtual space is astronomically large: 128 terabytes on each side (user space and kernel space).

Think of it this way:

Every citizen in the city is given a personal, private map. On this map, the city looks enormous — endless streets, unlimited land plots, and every address neatly numbered from 0 to some incomprehensibly large number. The citizen believes they own the entire city.

But the map is a fiction. A beautiful, useful fiction.

Behind the scenes, the kernel is the cartographer. It draws these maps, and it decides which “virtual addresses” on the map correspond to actual, physical plots of land (RAM). Many addresses on the map point to nothing at all — they’re blank, unallocated. Some addresses share the same physical land with other citizens (shared memory). And some addresses point to land that’s been temporarily “moved” to a faraway warehouse (swap space on disk).

Why the Illusion?

Three critical reasons:

  1. Isolation & Safety: Process A can never accidentally (or maliciously) read or write Process B’s memory. Their maps are completely separate. A rogue process can trash its own space, but it can never touch another citizen’s home.
  2. Simplification: Every process sees a clean, contiguous address space starting from zero. No process needs to worry about where it physically lives in RAM, or about working around other processes’ memory.
  3. Overcommitment: The kernel can promise more memory than physically exists. Just as a city can sell more land leases than it has physical land — betting that not everyone will build on their plots at the same time — the kernel can hand out virtual addresses generously and only allocate physical RAM when it’s actually used. This is called overcommitting, and it’s both a superpower and a source of drama (more on that later).

The Data Structure: mm_struct

Every process in Linux has a structure called mm_struct — its personal memory blueprint. This structure holds everything the kernel needs to know about that process’s virtual address space:

  • Where its code (text segment) lives
  • Where its data lives
  • Where the heap starts and ends
  • Where the stack is
  • A tree of all memory regions (VMAs — Virtual Memory Areas)
textProcess (task_struct)
   └── mm_struct (memory descriptor)
         ├── mmap (linked list of VMAs)
         ├── pgd (pointer to page table)
         ├── total_vm (total pages mapped)
         ├── start_code, end_code
         ├── start_data, end_data
         ├── start_brk, brk (heap boundaries)
         ├── start_stack
         └── ... and much more

When the kernel context-switches from one process to another, one of the key things it does is swap out the old process’s page tables and load the new ones. The CPU’s view of memory changes entirely. The new process wakes up, looks at its personal map, and sees its own private city — unaware that the map just changed.


Chapter 2: The Real Estate — Physical Memory and Pages

Now let’s talk about the actual land — the physical RAM.

Pages: The Atomic Unit of Memory

The kernel doesn’t manage memory byte by byte. That would be like a city planner managing land one square inch at a time — absurdly tedious and inefficient.

Instead, it divides all physical memory into fixed-size chunks called pages. On most systems, a page is 4 KB (4,096 bytes). Some architectures support larger pages — huge pages of 2 MB or even 1 GB — but 4 KB is the standard building block.

Think of a page as a standard-sized plot of land in our city. Every building, park, or warehouse must occupy one or more of these standard plots. You can’t own half a plot. You can’t own 5.3 plots. It’s always whole numbers.

Every physical page in the system is tracked by a structure called struct page. This is one of the most important data structures in the entire kernel. Each struct page contains:

  • Flags: Is this page free? Dirty? Locked? Being written back to disk? Part of a slab cache?
  • Reference count (_refcount): How many things are using this page?
  • Mapping information: What (if anything) is this page mapped to?
  • LRU list pointers: For the page reclamation system (we’ll get to this)

The kernel maintains an array of these structures — one for every physical page in the system. On a machine with 16 GB of RAM, that’s roughly 4 million page structures.

Cstruct page {
    unsigned long flags;        // Page status flags
    atomic_t _refcount;         // Usage reference count
    atomic_t _mapcount;         // Number of page table mappings
    struct list_head lru;       // For page replacement lists
    struct address_space *mapping; // Associated address space
    pgoff_t index;              // Offset within mapping
    // ... many more fields
};

Memory Zones: Not All Land is Created Equal

Here’s a complication: not all physical memory is the same. Due to hardware constraints (especially on older or specialized hardware), the kernel divides physical memory into zones:

  • ZONE_DMA (0–16 MB): The “historic district.” Some ancient hardware (old ISA devices) can only access the first 16 MB of memory for Direct Memory Access. This tiny zone is reserved for them.
  • ZONE_DMA32 (0–4 GB): For 32-bit devices that can address up to 4 GB but nothing beyond. On 64-bit systems, this is a compatibility zone.
  • ZONE_NORMAL (16 MB–896 MB on 32-bit; much larger on 64-bit): The “main residential area.” This is where most kernel data structures live and where the kernel can directly map physical memory.
  • ZONE_HIGHMEM (above 896 MB, 32-bit only): The “outskirts.” On 32-bit systems, the kernel can’t directly address memory above ~896 MB without special tricks (temporary mappings). On 64-bit systems, this zone doesn’t exist — everything is directly addressable.

Imagine the city sits on varied terrain. The waterfront zone (ZONE_DMA) is prime real estate but tiny — reserved for old, important institutions that can’t relocate. The main downtown (ZONE_NORMAL) is where most of the action happens. And the distant highlands (ZONE_HIGHMEM) are accessible but require a special shuttle bus to reach.

Each zone has its own free page lists, its own watermarks (minimum thresholds of free pages), and its own statistics. When the kernel needs to allocate memory, it considers which zone the allocation should come from based on the requester’s constraints.


Chapter 3: The Map Room — Page Tables and Address Translation

So we have virtual addresses (the fictional map) and physical pages (the real land). Something needs to connect them. That something is the page table.

The Translation Process

When a process accesses memory address 0x00007fff5a3b2000, the CPU doesn’t go directly to physical RAM at that address. Instead, it says:

“I need to translate this virtual address into a physical one.”

The CPU consults the page table — a hierarchical data structure maintained by the kernel and stored in physical memory. The CPU has a special register (called CR3 on x86) that points to the top-level page table of the currently running process.

The Multi-Level Hierarchy

On a modern 64-bit x86 Linux system, page tables have five levels (since Linux 4.11, though most systems use four):

textVirtual Address (48 bits used)
┌─────────┬─────────┬─────────┬─────────┬─────────┬────────────┐
│  PGD    │  P4D    │  PUD    │  PMD    │  PTE    │  Offset    │
│ (9 bits)│ (9 bits)│ (9 bits)│ (9 bits)│ (9 bits)│ (12 bits)  │
└─────────┴─────────┴─────────┴─────────┴─────────┴────────────┘

Let me tell this as a story:

The Library Analogy

Imagine a massive library with billions of books (memory locations). You have a book code: A-3-7-2-5-42.

  • First, you go to Wing A (PGD — Page Global Directory). The lobby guard checks the first part of your code and points you to a wing.
  • In that wing, you find a Floor Directory (P4D). It tells you which floor to go to.
  • On that floor, you find a Section Directory (PUD — Page Upper Directory). It points to a section.
  • In that section, there’s a Shelf Directory (PMD — Page Middle Directory). It points to a specific shelf.
  • At the shelf, you find a Book Index (PTE — Page Table Entry). It gives you the exact physical shelf location where the book (data) is stored.
  • The last part, 42, is the offset — how far along the shelf (within the 4 KB page) your specific byte lives.

Each level is a table with 512 entries (because 2^9 = 512), and each entry is 8 bytes. So each page table level itself fits neatly in one 4 KB page (512 × 8 = 4096).

The TLB: The Cache That Saves Everything

Walking through 4–5 levels of page tables for every single memory access would be catastrophically slow. The CPU would spend most of its time just doing translations.

Enter the TLB — Translation Lookaside Buffer. This is a small, extremely fast cache inside the CPU that stores recent virtual-to-physical translations.

Think of the TLB as your personal notebook where you jot down shortcuts. “Last time I needed book A-3-7-2-5, it was on Physical Shelf 8,423.” Next time you need it, you don’t walk through all five directories — you just check your notebook.

The TLB is astonishingly effective. Hit rates of 99%+ are common, meaning the CPU almost never has to do a full page table walk. But when it misses (a TLB miss), the cost is significant — potentially hundreds of CPU cycles for a single translation.

This is also why context switches are expensive. When the kernel switches from Process A to Process B, it often has to flush the TLB (though modern CPUs with ASIDs — Address Space IDs — can avoid full flushes), because Process B’s virtual addresses map to completely different physical pages.

Page Table Entries: More Than Just Addresses

Each PTE (Page Table Entry) doesn’t just contain a physical address. It also contains flags that the CPU checks on every access:

  • Present bit: Is this page actually in RAM? (If not → page fault)
  • Read/Write bit: Is writing allowed?
  • User/Supervisor bit: Can user-space code access this?
  • Accessed bit: Has this page been read recently? (Used for page replacement decisions)
  • Dirty bit: Has this page been modified? (Critical for knowing what needs to be written back to disk)
  • NX (No Execute) bit: Can code be executed from this page? (Security feature to prevent code injection attacks)

These flags are the access control system of our city. Every plot of land has a sign out front: “Residents only,” “Read-only archives,” “No construction allowed,” or “Condemned — do not enter.”


Chapter 4: The Great Fault — What Happens When You Touch Unmapped Memory

This is where things get really interesting. Let’s talk about page faults — the kernel’s most elegant error-handling mechanism.

The Setup

Remember, the kernel is generous with virtual addresses but stingy with physical RAM. When a process calls malloc(1000000) to allocate a million bytes, here’s what actually happens:

  1. The C library’s malloc adjusts the process’s heap (or calls mmap for large allocations).
  2. The kernel updates the process’s virtual memory regions (VMAs) to say: “Yes, this range of virtual addresses is valid.”
  3. No physical RAM is allocated. Not a single byte.

It’s like getting a lease for a plot of land, but when you drive to the address, there’s… nothing there. Just an empty field with a “Coming Soon” sign.

So what happens when the process actually tries to use that memory — reads from or writes to it?

The Page Fault Dance

BOOM. The CPU tries to translate the virtual address, walks the page tables, and finds… the Present bit is not set. The page doesn’t exist in physical memory.

The CPU generates a page fault exception and hands control to the kernel.

Now the kernel’s page fault handler kicks in. This is one of the most performance-critical code paths in the entire kernel. It has to figure out, very quickly, what kind of fault this is:

1. Valid Fault — Demand Paging (The “Oh, Right, Let Me Build That For You” Fault)

The kernel checks: “Is this address within a valid VMA?” If yes, the process is allowed to access this address — the kernel just hasn’t backed it with physical RAM yet.

The kernel then:

  • Finds a free physical page
  • Zeroes it out (for security — you don’t want to hand a process a page with another process’s leftover data!)
  • Updates the page table to map the virtual address to this new physical page
  • Sets the Present bit
  • Returns to the process, which retries the instruction — and this time it works

This is demand paging in action. The city didn’t build the house until you showed up at the front door. It’s lazy — beautifully, efficiently lazy.

This is why malloc is so fast. It barely does anything. The real work happens later, spread out over time, only when each page is actually touched.

2. Valid Fault — File-Backed Page (The “Let Me Fetch That From the Archives” Fault)

Sometimes the page fault occurs because a process is accessing a memory-mapped file (via mmap). The VMA is valid, but the data needs to be read from disk.

The kernel:

  • Finds a free page
  • Reads the corresponding file data from disk (or from the page cache if it’s already cached)
  • Maps it into the process’s address space
  • Returns to the process

This is how executables are loaded. When you run /usr/bin/python3, the kernel doesn’t read the entire binary into memory. It maps the file and sets up VMAs. Then, as the CPU executes code, page faults fire and individual pages are read from disk on demand. If you never call a certain function, its code may never be loaded into RAM at all.

3. Valid Fault — Swapped Out Page (The “It’s In Storage, Let Me Retrieve It” Fault)

The page was once in RAM, but the kernel reclaimed it and wrote its contents to swap space (on disk). The PTE still exists but has the Present bit cleared, with the swap location encoded in its remaining bits.

The kernel:

  • Reads the page from swap
  • Allocates a new physical page
  • Copies the data in
  • Updates the page table
  • Returns to the process

4. INVALID Fault — Segmentation Fault (The “You Shall Not Pass” Fault)

The virtual address doesn’t fall within any valid VMA. The process is trying to access memory it was never given.

The kernel delivers the dreaded SIGSEGV — the segmentation fault signal. The process is usually killed.

This is the city’s police showing up. “Sir, this address doesn’t exist on any map. You don’t own property here. Please leave.”

C// Simplified page fault handling logic
void handle_page_fault(unsigned long address) {
    vma = find_vma(current->mm, address);
    if (!vma || address < vma->vm_start) {
        // Invalid access — SIGSEGV!
        send_signal(SIGSEGV, current);
        return;
    }
    if (!(vma->vm_flags & VM_WRITE) && fault_is_write) {
        // Writing to read-only memory — SIGSEGV!
        send_signal(SIGSEGV, current);
        return;
    }
    // Valid fault — allocate page and map it
    page = alloc_page(GFP_HIGHUSER);
    clear_page(page);
    install_page_table_entry(address, page);
    // Process resumes, instruction retries
}

How Many Page Faults Are Normal?

A lot. Page faults aren’t errors — they’re a fundamental mechanism. A typical process startup involves thousands of page faults. They’re so common and so fast (when they don’t involve disk I/O) that they’re called minor page faults. When disk I/O is needed (reading from a file or swap), they’re major page faults — much more expensive.

You can see your system’s page fault stats with:

Bash# Per-process
ps -o min_flt,maj_flt -p <PID>
# System-wide
vmstat 1

Chapter 5: The Buddy System — How the Kernel Manages Free Pages

Now, let’s zoom into how the kernel tracks which physical pages are free and how it allocates them efficiently.

The Problem

Imagine you’re the city planner, and you have millions of free plots of land. Processes come to you constantly asking for pages — sometimes one page, sometimes 16 contiguous pages, sometimes 1,024 contiguous pages. You need to:

  1. Find free pages fast
  2. Keep memory unfragmented — you want contiguous blocks available for large allocations
  3. Return freed pages to the pool efficiently

The Buddy Allocator

Linux uses a brilliant algorithm called the Buddy System. Here’s how it works:

The allocator organizes free pages into lists by order, where each order represents a power-of-two block size:

textOrder 0:  blocks of 1 page    (4 KB)
Order 1:  blocks of 2 pages   (8 KB)
Order 2:  blocks of 4 pages   (16 KB)
Order 3:  blocks of 8 pages   (32 KB)
...
Order 10: blocks of 1024 pages (4 MB)   ← Maximum

Think of it like a lumber yard. You have stacks of pre-cut wood in standard sizes: 1-foot, 2-foot, 4-foot, 8-foot, and 16-foot lengths. When someone needs a 4-foot piece, you grab one from the 4-foot stack. Simple.

Allocation: Splitting

What if someone needs a block of order 2 (4 pages), but the order-2 list is empty? The allocator goes up:

  1. Check order 2 — empty.
  2. Check order 3 — found an 8-page block!
  3. Split the 8-page block into two buddies of 4 pages each.
  4. Give one to the requester.
  5. Put the other on the order-2 free list.

If order 3 was also empty, it would check order 4, split a 16-page block in half, put one half on order 3’s list, then split the other half, and so on. It’s recursive splitting.

Back to the lumber yard: no 4-foot pieces left? Grab an 8-foot piece and saw it in half. No 8-foot pieces? Grab a 16-foot piece, saw it in half to get two 8-footers, then saw one of those in half to get your 4-footer.

Deallocation: Coalescing

When a block is freed, the allocator checks: is my buddy also free? If yes, they merge back into a larger block. Then that larger block checks its buddy, and so on.

“My buddy” means the other half of the block I was split from. Two 4-page blocks that were split from the same 8-page parent are “buddies.”

textAllocation:
  [████████████████]  Order 4 (16 pages)
  [████████][████████]  Split → two Order 3 blocks
  [████████][████][████]  Split one → two Order 2 blocks
  [████████][████][██][██]  Split one → two Order 1 blocks
  Return one Order 1 block ──────────────────┘
Deallocation (coalescing):
  Free [██] → check buddy [██] → buddy is free!
  Merge → [████] → check buddy [████] → buddy is free!
  Merge → [████████] → check buddy [████████] → buddy is NOT free.
  Stop. Place [████████] on Order 3 free list.

This system is beautiful in its simplicity and does an excellent job of combating external fragmentation — the situation where you have enough total free memory but no contiguous block large enough for a request.

You can see the buddy system’s current state in:

Bashcat /proc/buddyinfo

Output might look like:

textNode 0, zone   Normal   1204  789  543  312  198   87   42   18    6    2    1

Each number represents how many free blocks exist at that order level. The last column (order 10) shows you have 1 block of 4 MB free.


Chapter 6: The Slab Allocator — The Kernel’s Small Object Specialist

The buddy system works in whole pages (4 KB minimum). But the kernel constantly needs to allocate much smaller objects — a 200-byte struct inode, a 700-byte struct task_struct, a tiny 64-byte buffer.

Allocating a full 4 KB page for a 200-byte object wastes 3,896 bytes. That’s a 95% waste rate. Unacceptable.

Enter the Slab Allocator

The slab allocator sits on top of the buddy system. It grabs whole pages from the buddy allocator, then carves them up into small, fixed-size objects.

Imagine the buddy system is a wholesale warehouse that sells lumber in 4-foot minimum lengths. The slab allocator is a specialty shop that buys 4-foot pieces from the warehouse and cuts them into precisely-sized pegs, dowels, and brackets that customers actually need.

How It Works

The slab allocator creates caches — one for each type of object:

textkmem_cache: "task_struct"     → objects of 700 bytes each
kmem_cache: "inode_cache"     → objects of 200 bytes each
kmem_cache: "dentry"          → objects of 192 bytes each
kmem_cache: "vm_area_struct"  → objects of 176 bytes each
kmem_cache: "kmalloc-64"      → objects of 64 bytes each
kmem_cache: "kmalloc-128"     → objects of 128 bytes each
...

Each cache contains one or more slabs — a slab is one or more contiguous pages divided into equal-sized slots:

textSlab (one or more pages)
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ obj  │ obj  │ obj  │ obj  │FREE  │FREE  │FREE  │FREE  │
│ used │ used │ used │ used │      │      │      │      │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘

When the kernel needs a new task_struct, it goes to the “task_struct” cache, finds a slab with a free slot, and returns a pointer. When the object is freed, the slot is simply marked available. No fragmentation, because every object in a cache is the same size.

The Three Implementations

Linux has actually had three slab allocator implementations over the years:

  1. SLAB — The original (2.0+). Feature-rich but complex.
  2. SLUB — The current default (2.6.23+). Simpler, faster, better scaling on multi-CPU systems.
  3. SLOB — Stripped-down version for embedded systems with tiny memory.

You can see slab statistics with:

Bashcat /proc/slabinfo
# or, more readably:
slabtop

kmalloc — The Kernel’s General-Purpose Allocator

When kernel code needs memory of an arbitrary size (not a specific object type), it calls kmalloc(). This function uses a set of generic slab caches with power-of-two sizes:

Cvoid *ptr = kmalloc(300, GFP_KERNEL);
// Allocates from the "kmalloc-512" cache (next power of 2)
// Wastes 212 bytes, but that's acceptable
kfree(ptr);  // Returns it to the cache

The waste from rounding up to a power of two is called internal fragmentation. It’s the trade-off for the speed and simplicity of the slab system.

vmalloc — When You Need a Lot But Don’t Care About Physical Contiguity

kmalloc gives you memory that’s physically contiguous — useful for DMA and hardware interactions. But for large allocations where physical contiguity doesn’t matter, there’s vmalloc.

vmalloc allocates individual pages (which may be scattered all over physical RAM) and then maps them into a contiguous range of virtual addresses in the kernel’s address space.

It’s like renting ten apartments scattered across the city but giving them all the same building name on your map. Logically, they feel like one big office — but physically, they’re all over the place.

Cvoid *big_buffer = vmalloc(1024 * 1024);  // 1 MB
// Physically scattered pages, virtually contiguous
// Slower to allocate and access (TLB pressure)
vfree(big_buffer);

Chapter 7: Copy-on-Write — The Kernel’s Cleverest Optimization

This one is a personal favorite. It’s so clever it almost feels like cheating.

The Scenario

When a process calls fork() to create a child process, the child is supposed to be an exact copy of the parent — same code, same data, same stack, same everything.

The naïve approach: copy all of the parent’s memory to the child. If the parent uses 500 MB of RAM, allocate 500 MB for the child and copy every byte.

But this is absurdly wasteful. In most cases, fork() is immediately followed by exec(), which replaces the child’s entire memory with a new program anyway. All that copying? Completely pointless.

The Trick: Copy on Write (CoW)

Instead of copying, the kernel does something clever:

  1. Share everything. Both parent and child point to the same physical pages.
  2. Mark all shared pages as read-only in both processes’ page tables.
  3. Wait.

Both parent and child are given the same apartment keys. The kernel puts a sign on every room: “Look, don’t touch.”

Now, when either process tries to write to a shared page, a page fault occurs (because the page is marked read-only). The kernel’s fault handler recognizes this as a CoW fault and:

  1. Allocates a new physical page
  2. Copies the contents of the shared page to the new page
  3. Maps the new page into the writing process’s page table (now with read-write permissions)
  4. The other process keeps the original page

One roommate wants to repaint a wall. Fine — the kernel builds them their own copy of that room. The other roommate’s room stays untouched.

textBefore fork():
  Parent: VAddr 0x1000 → Physical Page #42 (R/W)
After fork():
  Parent: VAddr 0x1000 → Physical Page #42 (R/O)  ← shared!
  Child:  VAddr 0x1000 → Physical Page #42 (R/O)  ← shared!
After child writes to 0x1000:
  Parent: VAddr 0x1000 → Physical Page #42 (R/W)  ← keeps original
  Child:  VAddr 0x1000 → Physical Page #99 (R/W)  ← new copy

This means fork() is nearly instant, regardless of how much memory the parent uses. The cost is deferred and spread out — and if the child never modifies a page, no copy is ever made.

CoW is used extensively throughout the kernel — not just for fork(), but for file mappings, zero pages, and more.


Chapter 8: The Page Cache — Befriending the Disk

Disk I/O is slow. Even SSDs are thousands of times slower than RAM. So the kernel does everything it can to avoid touching the disk.

How It Works

The page cache is a region of RAM that holds copies of data from files on disk. When you read a file, the data goes through the page cache:

textFirst read of file.txt:
  Process → Kernel → Disk → Page Cache → Process
Second read of file.txt:
  Process → Kernel → Page Cache → Process  ← Disk not touched!

The page cache is the city’s central library. The first time someone requests a book from the distant archives, a copy is made and placed in the library. The next time anyone requests that same book, they get the library copy — instant.

Here’s the beautiful part: the page cache is dynamic. It grows to fill all “unused” RAM. On a system with 16 GB of RAM and 4 GB used by processes, the page cache might use 10 GB or more. This is why your Linux system often shows very little “free” memory — it’s being used productively as cache.

Bashfree -h
#               total    used    free   shared  buff/cache  available
# Mem:           16G     3.8G    512M    256M      11.7G       11.5G

That “512M free” might alarm you, but “11.5G available” means the kernel can reclaim cache memory instantly if processes need it. Cache memory is not wasted memory — it’s memory being useful while it waits.

Dirty Pages and Writeback

When a process writes to a file, the write goes to the page cache (fast!) and the page is marked dirty. The actual disk write happens later, asynchronously, by kernel threads called kworker or flush threads.

This is called write-back caching, and it makes writes appear near-instantaneous. The downside? If the system crashes before dirty pages are written back, data can be lost. This is why fsync() exists — it forces dirty pages to disk.


Chapter 9: Memory Reclamation — When the City Runs Out of Land

Everything we’ve described so far works beautifully when there’s plenty of free memory. But what happens when RAM gets scarce?

This is where the kernel becomes a ruthless but fair landlord.

The Watermark System

Each memory zone has three watermark levels:

textHIGH ─── ✅ "Everything's fine. Allocate freely."
         │
MIN  ─── ⚠️ "Getting low. Start reclaiming in background."
         │
LOW  ─── 🚨 "Critical. Reclaim aggressively. Block allocations."

When free memory drops below the LOW watermark, the kernel wakes up kswapd, a per-zone kernel daemon whose sole purpose is to reclaim memory.

What Gets Reclaimed?

The kernel uses several strategies:

1. Reclaiming Clean Page Cache Pages

Pages that are copies of disk data and haven’t been modified. They can be dropped instantly — the data can always be re-read from disk.

“This library book hasn’t been annotated. We can return it and borrow it again if needed.”

2. Writing Back and Reclaiming Dirty Page Cache Pages

Modified file data must be written to disk first, then the pages can be freed. This is slower.

3. Swapping Out Anonymous Pages

Pages that belong to process heaps and stacks (not file-backed) can be written to swap space — a partition or file on disk that acts as overflow RAM.

“We’re running out of apartments. Some residents who haven’t been seen in a while are being moved to temporary housing outside the city (swap). If they come back, we’ll move them back in.”

The LRU Lists: Who Gets Evicted?

How does the kernel decide which pages to reclaim? It uses an approximation of LRU — Least Recently Used. Pages that haven’t been accessed recently are more likely to be evicted.

The kernel maintains two sets of LRU lists:

  • Active list: Pages that have been accessed recently. These are “hot.”
  • Inactive list: Pages that haven’t been accessed in a while. These are “cold” and are candidates for reclamation.

Pages flow between these lists:

text                    ┌─── Accessed ───┐
                    │                │
New page → [Inactive List] ←──── [Active List]
                    │
                    ↓
              Reclaimed / Evicted

A page starts on the inactive list. If it’s accessed, it gets promoted to the active list. If the active list gets too big, its oldest pages are demoted back to the inactive list. Pages that sit on the inactive list without being accessed eventually get reclaimed.

The kernel also distinguishes between file-backed pages (page cache) and anonymous pages (heap, stack). The swappiness parameter (default: 60) controls the balance:

Bashcat /proc/sys/vm/swappiness
# 60 = balanced (default)
# 0  = strongly prefer reclaiming file pages; avoid swapping
# 100 = treat file and anonymous pages equally

Chapter 10: The OOM Killer — The Kernel’s Nuclear Option

What happens when all reclamation strategies fail? When there’s no more cache to drop, swap is full, and processes keep demanding more memory?

The kernel reaches for its most controversial tool: the OOM Killer (Out-Of-Memory Killer).

The Story

It’s a dark day in the city. Every plot of land is occupied. The warehouses (swap) are overflowing. Citizens are lined up demanding space, and there’s nowhere to put them. The city is about to grind to a halt.

The city has a last-resort protocol: identify the citizen causing the most strain and… evict them. Permanently.

The OOM Killer selects a process and kills it to free memory. It’s brutal, but the alternative — a completely frozen, unresponsive system — is worse.

How It Chooses a Victim

The kernel calculates an OOM score for every process, based on:

  1. Memory usage: Processes using more memory get higher scores (more likely to be killed).
  2. The process’s oom_score_adj: A tunable value from -1000 to +1000. Critical services can set this to -1000 (never kill me) while unimportant processes can be set to higher values.
  3. Root processes get a slight discount (they’re assumed to be more important).
  4. Kernel threads are never killed.

You can see any process’s OOM score:

Bashcat /proc/<PID>/oom_score
cat /proc/<PID>/oom_score_adj

To protect a critical process (like your database):

Bashecho -1000 > /proc/<PID>/oom_score_adj

The OOM Killer’s Log

When the OOM Killer strikes, it leaves evidence in dmesg:

text[  372.145] Out of memory: Killed process 8423 (chrome)
            total-vm:4235680kB, anon-rss:2148364kB, file-rss:0kB

The police report reads: “Process 8423 (chrome), consuming 2 GB of RAM, was terminated to restore system stability.”

Overcommit: The Root Cause

Remember how the kernel can overcommit — promise more memory than exists? This is governed by:

Bashcat /proc/sys/vm/overcommit_memory
# 0 = Heuristic (default): kernel guesses if it's safe to overcommit
# 1 = Always overcommit: never say no to malloc (dangerous!)
# 2 = Never overcommit: strict accounting, malloc fails if not enough

In mode 2 (strict), the kernel will never promise more than swap + (RAM × overcommit_ratio/100). This means malloc can fail — returning NULL — but the OOM Killer will never need to fire.

Most systems use mode 0, which is a reasonable balance. The kernel allows some overcommit (because most allocated memory is never fully used) but tries to avoid extreme situations.


Chapter 11: Huge Pages — When 4 KB Isn’t Enough

For processes with massive memory footprints — databases, scientific computing, virtual machines — the overhead of managing millions of 4 KB pages becomes significant:

  • Page table bloat: A 1 GB dataset needs 262,144 page table entries.
  • TLB pressure: The TLB can only cache a few thousand entries. With tiny pages, TLB misses are frequent.

The Solution: Huge Pages

Linux supports huge pages — 2 MB (and on some systems, 1 GB) pages. A single 2 MB huge page replaces 512 regular pages, dramatically reducing:

  • Page table entries (512× fewer)
  • TLB misses (each TLB entry covers 512× more memory)
  • Page fault overhead (fewer faults needed)
textRegular pages:   1 GB = 262,144 × 4 KB pages
Huge pages:      1 GB = 512 × 2 MB pages
Giant pages:     1 GB = 1 × 1 GB page

Transparent Huge Pages (THP)

Linux has two huge page mechanisms:

  1. Explicit Huge Pages (hugetlbfs): Preallocated at boot time, managed manually. Used by databases like Oracle and PostgreSQL.
  2. Transparent Huge Pages (THP): The kernel automatically promotes groups of regular pages into huge pages, without application changes. Enabled by default on most systems.
Bashcat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never
# Check huge page stats
grep -i huge /proc/meminfo

THP is controversial. It improves performance for many workloads but can cause latency spikes during compaction (when the kernel rearranges memory to create contiguous 2 MB blocks). Some database administrators disable it for this reason.


Chapter 12: NUMA — When Memory Has Geography

On modern multi-socket servers, not all RAM is equal. In a NUMA (Non-Uniform Memory Access) system, each CPU socket has its own local memory, and accessing remote memory (attached to another socket) is slower.

text┌────────────────────────────────────┐
│            NUMA System             │
│                                    │
│  ┌──────────┐     ┌──────────┐    │
│  │  CPU 0   │     │  CPU 1   │    │
│  │ Socket 0 │     │ Socket 1 │    │
│  └────┬─────┘     └────┬─────┘    │
│       │                │          │
│  ┌────┴─────┐     ┌────┴─────┐    │
│  │ Memory   │────│ Memory   │    │
│  │ Node 0   │ QPI │ Node 1   │    │
│  │ (local)  │     │ (remote) │    │
│  └──────────┘     └──────────┘    │
└────────────────────────────────────┘
CPU 0 accessing Node 0: ~70 ns (fast - local)
CPU 0 accessing Node 1: ~130 ns (slow - remote, via QPI)

It’s like a city with two neighborhoods separated by a river. You can cross the bridge, but it takes twice as long as walking to your local store.

The kernel’s memory allocator is NUMA-aware. When a process running on CPU 0 requests memory, the kernel tries to allocate from Node 0 first. This is called local allocation policy.

Linux provides tools and system calls to control NUMA behavior:

Bashnumactl --hardware    # Show NUMA topology
numactl --membind=0 ./myapp   # Force memory allocation on node 0
numastat              # NUMA statistics

Chapter 13: Memory Compaction — Defragmenting the City

Over time, even with the buddy system, physical memory can become fragmented. You might have 100 MB of free memory, but no contiguous block larger than 4 KB. This makes it impossible to satisfy requests for huge pages or large kernel allocations.

The Compactor

The kernel has a compaction mechanism that works like a defragmenter:

  1. Start from both ends of a memory zone
  2. From the bottom: find movable pages (user-space pages whose page table entries can be updated)
  3. From the top: find free pages
  4. Move the movable pages to the free spaces at the top
  5. The bottom now has large contiguous free blocks
textBefore compaction:
  [USED][FREE][USED][FREE][USED][FREE][USED][FREE]
After compaction:
  [USED][USED][USED][USED][FREE][FREE][FREE][FREE]
                          ↑
                   Large contiguous block!

The city planner politely asks some residents to move to different apartments (same quality, same size) so that a large empty plot forms for a new park.

Compaction runs automatically when the kernel needs huge pages but can’t find contiguous blocks. You can also trigger it manually:

Bashecho 1 > /proc/sys/vm/compact_memory

Chapter 14: Memory cgroups — Resource Control in Containers

In the era of containers and cloud computing, you can’t let one container consume all the host’s memory. Linux uses memory cgroups (cgroup v2) to set limits:

Bash# Create a memory-limited cgroup
mkdir /sys/fs/cgroup/mygroup
echo "500M" > /sys/fs/cgroup/mygroup/memory.max
echo "400M" > /sys/fs/cgroup/mygroup/memory.high
# Add a process to it
echo $PID > /sys/fs/cgroup/mygroup/cgroup.procs
  • memory.max: Hard limit. If the group exceeds this, the OOM killer targets processes within the group — not system-wide.
  • memory.high: Soft limit. The kernel throttles allocations and applies memory pressure.

Each neighborhood in the city gets a land budget. If the residents of neighborhood A get greedy, only their own citizens get evicted — not the entire city.

This is the backbone of Docker and Kubernetes memory limits.


Chapter 15: Observing Memory — The Kernel’s Open Books

One of Linux’s greatest strengths is its transparency. The kernel exposes an extraordinary amount of memory information:

/proc/meminfo — The Master Dashboard

Bashcat /proc/meminfo
textMemTotal:       16384000 kB    # Total physical RAM
MemFree:          524288 kB    # Truly free (unused)
MemAvailable:   12000000 kB    # Available for new allocations
Buffers:          256000 kB    # Disk buffer cache
Cached:          9500000 kB    # Page cache
SwapCached:        45000 kB    # Swap pages also in RAM
Active:          6000000 kB    # Recently accessed pages
Inactive:        5500000 kB    # Not recently accessed
SwapTotal:       8000000 kB    # Total swap space
SwapFree:        7800000 kB    # Free swap
Dirty:              1200 kB    # Pages waiting for disk write
AnonPages:       3500000 kB    # Non-file-backed pages
Mapped:           800000 kB    # mmap'd file pages
Slab:             600000 kB    # Kernel slab allocator
SReclaimable:     450000 kB    # Reclaimable slab memory
PageTables:        50000 kB    # Page table memory

/proc/<PID>/maps — A Process’s Memory Map

Bashcat /proc/self/maps
text55a8c1000000-55a8c1002000 r--p 00000000 08:01 1234  /usr/bin/bash
55a8c1002000-55a8c10a0000 r-xp 00002000 08:01 1234  /usr/bin/bash
55a8c10a0000-55a8c10d0000 r--p 000a0000 08:01 1234  /usr/bin/bash
55a8c1200000-55a8c1280000 rw-p 00000000 00:00 0     [heap]
7f8a20000000-7f8a24000000 rw-p 00000000 00:00 0     [anon]
7ffee8800000-7ffee8821000 rw-p 00000000 00:00 0     [stack]

Every line is a VMA — a region of the process’s virtual address space. You can see the permissions (r/w/x), whether it’s private or shared (p/s), and what file it maps.

Other Useful Tools

Bashvmstat 1          # System-wide VM statistics, updated every second
slabtop           # Real-time slab cache usage
pmap <PID>        # Process memory map with sizes
smem              # Per-process memory with shared memory accounting
perf stat -e page-faults ./myapp   # Count page faults

Epilogue: The Symphony of Memory

If you’ve made it this far, you’ve journeyed through one of computing’s most sophisticated systems. Let’s zoom out and see it all working together:

  1. A process calls malloc(4096).
  2. The C library extends the process’s heap (or calls mmap).
  3. The kernel creates a VMA but allocates no physical memory (lazy allocation).
  4. The process writes to the new address.
  5. The CPU can’t translate it → page fault.
  6. The kernel’s fault handler checks: valid VMA? Yes.
  7. The kernel asks the buddy system for a free page from the appropriate zone.
  8. If no free pages → kswapd wakes up, checks LRU lists, reclaims page cache or swaps out anonymous pages.
  9. If still no memory → direct reclaim, then compaction, and in the worst case, OOM killer.
  10. The page is found/freed, zeroed out for security, and mapped into the page table.
  11. The TLB is updated.
  12. The process continues, oblivious to the thousands of lines of kernel code that just executed.

All of this happens in microseconds. Thousands of times per second. For hundreds of processes simultaneously. On systems running for years without rebooting.


Summary: Key Takeaways

ConceptSimple Explanation
Virtual MemoryEvery process gets its own private, fake address space
PagesMemory is managed in 4 KB chunks
Page TablesMulti-level lookup tables translating virtual → physical
TLBCPU cache for fast address translation
Page FaultsNot errors — they’re how memory gets allocated on demand
Buddy SystemFree pages organized in power-of-two blocks
Slab AllocatorEfficient allocation of small kernel objects
Copy-on-WriteShare pages until someone writes; only then copy
Page CacheRAM used to cache file data from disk
LRU ListsTrack which pages were used recently (for eviction decisions)
SwapOverflow memory on disk for when RAM is full
OOM KillerLast resort: kill a process to free memory
Huge Pages2 MB/1 GB pages to reduce overhead for large workloads
NUMAMemory has locality; local access is faster
Memory cgroupsPer-container/per-group memory limits

The Linux kernel’s memory management system is a masterpiece of engineering — a system that juggles scarcity, security, performance, and fairness, all while remaining invisible to the applications it serves. It’s the silent city planner that never sleeps, never takes a day off, and somehow keeps millions of demanding citizens happy.

And now, you know how it works.

Visit this, if you’re interested in Computer Science