x86 4-level and 5-level pagetable on Linux

Hongyi Lu (jwnhy) Aug 03, 2023

Theory

Intel supports 5-level paging, supporting over 128 PB of memory. However, this makes implementation of Linux a bit werid. So I write this to help myself remember. The kernel I use is Linux 6.1.38.

Overall, there is also a 5-level structure in Linux as follows.

CR3 (128PB) -> pgd (256TB) -> p4d (512GB) -> pud (1GB) -> pmd (2MB) -> pte (4KB)

Interestingly, structures like struct pgd_t are in fact entries inside pagetables. For instance, the following code returns the corresponding pgd entry for a specific address.

static inline pgd_t *pgd_offset_pgd(pgd_t *pgd, unsigned long address)
{
    return (pgd + pgd_index(address));
};

However, this causes a divergence in terms of semantics of *_offset functions.

For pgd_t and pgd_t only, the pgd_offset_* functions perform same level offseting (pgd_t -> pgd_t).

For other levels, the *_offset function return the pagetable for next level (e.g., pgd_t -> p4d_t for p4d_offset())

Subject to the exact configuration, p4d and pud may not exist, but pgd always exists. In the cases, where p4d or pud do not exist. Their macros are replaced with dummy implementation.

For example, the macro p4d_offset is as follows. When only 4-level paging is activated, it directly returns pgd (converted to p4d);

static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
{
    if (!pgtable_l5_enabled())
        return (p4d_t *)pgd;
    // Note `*pgd` is used here, extracting the `pgd` entry.
    return (p4d_t *)pgd_page_vaddr(*pgd) + p4d_index(address);
}

Similarly for macros like p4d_index, these dummy macros just simply return 0 when p4d is folded. This is presumably for a unified implementation of pagetable walking (same implementation for {3,4,5}-level paging).

The following picture shows how Linux uses different structures to handle pagetable hierarchy, epspecially p4d = (p4d_t*)(*pgd) + p4d_index(addr).

Particularly, when only 4-level paging is enabled, the p4d_index(*) always returns 0, that is pgd directly points to different pud.

Practice

The reason I dig into this is that I need to map an unused part of virtual address space of Linux and use it for my own purpose.

Specifically, I want to use the 2TB hole from fffffc0000000000 to fffffdffffffffff, and this should be shared between all kernel thread, meaning it should be injected into init_mm, the address space of init process.

This design only works for 4-level paging w.o. Kernel Pagetable Isolation (KPTI).

pagetable.drawio.png

In practice here, we only care about 4-level paging, meaning p4d’s are always folded as shown in above figure.

void init_x(void) {
  /* other code */
  top_pgd = init_mm.pgd;
  pgd = pgd_offset_pgd(top_pgd, addr);
  p4d = p4d_offset(pgd, addr);   // dummy transition

  /* we assume 4 page level, pgd = p4d*/
  BUG_ON(p4d != (p4d_t *)pgd);
  nr_pgd = (MOAT_END - MOAT_START) >> 39;
  for (i = 0; i < nr_pgd; i++) {
    BUG_ON(!moat_pud_alloc(&init_mm, p4d, addr));
    addr = pgd_addr_end(addr, MOAT_END);
    pgd = pgd_offset_pgd(top_pgd, addr);
    p4d = p4d_offset(pgd, addr); // dummy transition.
  }
}

After this function is executed, four additional pud are allocated, pgd[504-507] pointing to four different pud table.