Virtualizing-Mem

Virtualizing Memory

The OS ultimately lets user programs pretend they have one contiguous, large piece of memory all to themselves, called an address space. Let's say this space looks like [START, END]. Programs (roughly) store code & static data at the front of this address space [START, CODE_END], and then a heap region that can grow dynamically downward [CODE_END, ..], and a stack region that grows dynamically upward [.., END].

Importantly, only the OS and hardware know exactly what the physical addresses are; if you print pointers from a C program you are only ever going to see the virtual addresses. This is important as part of isolation.

Memory API

The stack allocates automatically:

void func() {
	int x; // allocates int on the stack
} // deallocates after function call return

While the heap is allocated explicitly:

void func() {
	// malloc: allocate memory on the heap
	// sizeof: compile-time proc that returns size in bytes 
	// (int *): cast from void pointer to integer pointer
	// int *x =: allocate pointer on the stack
	int *x = (int *) malloc(sizeof(int));
	...
	// free takes the exact pointer returned from `malloc`
	// and deallocates the entire region
	free(x);
}

Random tips:

Don't forget the end of string character: use malloc(strlen(s) + 1). (Or better yet use strdup).
Use calloc to allocate and also automatically initialize to zeroes.
Use realloc to resize an existing memory region.

Under the hood, the C library is using system calls like brk and sbrk to change the program's break: the location of the end of the heap. Don't use this as a programmer, you'll fuck with the malloc lib.

You can also use mmap to create anonymous memory regions in swap space.

Homework

Check out notes on gdb and valgrind.

Address Translation

Our first attempt at VM will be a bit primitive, but analogous to |LDE. It's called dynamic relocation. In this technique, each CPU needs two hardware registers:

base register: (offset for where process memory starts)
bounds register: (limit for where process memory ends (or equivalently the size of limit))
This part of the processor chip that helps with address translation is called the memory management unit (MMU). With these registers set, the address translation happens purely on the hardware when executing instructions from the process. When it sees some address X in user mode, it will automatically use BASE + X.
The CPU also needs to be able to throw out of bounds exceptions when processes attempt to access memory outside of their range (i.e. if X > BOUNDS).

The base/bounds registers are also something that gets stored in the process struct, and they need to be stored/restored upon context switches.

To summarize, the hardware requirements are:

a privileged mode
base/bounds registers
ability to translate virtual addresses & check within bounds
privileged instructions to update base/bounds
privileged instructions to register exception handlers
ability to raise exceptions

And the operating system requirements are:

memory management: need to allocate & reclaim memory for new & terminated processes
base/bounds management: setting base/bounds on context switch
exception handling: code to run when exceptions arise

Unfortunately this primitive idea, while efficient from a processing perspective as most of the operations are offloaded to hardware, is not an efficient use of memory. Specifically it suffers from internal fragmentation: there is wasted memory inside each process allocated unit (all the space between each process's heap & stack).

Segmentation

The main advance in segmentation is to generalize base & bounds so that we store a pair per segment. Now instead of a fixed memory allocation per process, a process can take up an arbitrarily small or large piece of memory, and each of its code/stack/heap segments can be stored anywhere in memory. Note that now malloc might be harder to service, if a region of memory ends up backing up against another.

The virtual address can use the first two bits to specify 00 code, 01 heap, 11 stack, and the rest of the bits specify the offset into the segment.
The main problem with this segmentation approach is how to allocate memory in ways that don't result in external fragmentation: small bits of memory unused between processes because they are too small to be useful.
When this happens, the OS can compact the memory, rearranging the pieces so that you get a larger unused region of memory, but this is very costly.
This is a general problem, and many algorithms exist for free-list management that try their best to keep large regions of memory available for allocation.

Free Memory Management

Basic mechanisms include a linked list storing {offset, size} and operations like split to split a node into two when memory is allocated and coalesce to merge adjacent pieces of memory when freeing.
Basic strategies include best fit (smallest region that fits request), worst fit (the largest region), first fit (first region that fits request), and next fit (like first fit, but start at the last place of the previous search)
More interesting techniques:
- Segregated lists: useful for popular-sized requests, the allocator can maintain two lists, one that's pre-split into the fixed size, and a general list for other requests.
- Buddy allocation: keep dividing the space by two, building a binary tree of memory regions, until another division would be too small. Freeing is easy too; check if the buddy region (the binary splitting naturally makes adjacent pairs) is free, and if so merge them both together. Keep doing this recursively for the bigger buddy chunks.

Paging

Instead of the variable sized pieces of memory used in segmentation, paging uses fixed size pieces of memory called pages. The virtual address space of a process uses contiguous virtual pages, and the physical memory is split up into pages called frames.
Now the physical memory of a process can be split all over the physical memory, and the OS needs to keep a page table that maps the virtual pages to the physical frames per process.
The page table takes up a lot of memory (at least naively), so it is stored in memory as well, not in the MMU or anywhere else on chip.
Because it's in memory, address translations are now more costly, as they require look ups to memory for each translation (as opposed to the simple base/bounds translation that was purely in hardware).
Again the first few bits of the virtual address determines the page, and the rest determine the offset into that page.
The page table entry not only contains the physical frame number, but also a header of a number of other useful bits (dirty, present in memory, r/w/e perms, validity bit specifiying if this page is actually allocated and can be accessed).
This leaves two issues to solve: 1) expensive memory lookups for translations and 2) large memory footprint of the page table.

TLB (Translation Lookaside Buffer)

As mentioned above, looking in memory for each VPN->PFN translation is extremely costly. So instead, we use TLBs as an address translation cache directly on the MMU hardware. When a process requests a page that is in the TLB buffer, there's no memory access and the page fetch is very fast.
Since The VPN->PFN translation is different per process, we need to manage context switches. A simple solution is to just flush the cache on a switch (by setting validity bit to 0).
To avoid this context switch inefficiency, the TLB entries can contain an ASID (Address Space Identifier) so that the TLB can serve multiple processes at a time. (There can even be multiple entries with the same frame - e.g. shared libraries.)
CISC (Complex Instruction Set Computing) instruction sets have lots of powerful instructions, and these systems can manage TLB misses directly in hardware. In constrast, RISC (Reduced Instruction Set Computing) instruction sets have a small number of simple instructions, and leave the TLB miss handling to the OS.
Cache replacement techniques like LRU (Least Recently Used) can be good, but have degenerative edge cases:
- Repeatedly loop over an array of size n+1 with a cache size of n: every access will be a TLB miss.
A random replacement is much simpler and more efficient to program, and avoids degenerative edge cases like that.
Frequently exceeding the TLB coverage can cause big performance problems. To avoid this, the OS can support larger page sizes. Programs can exploit this and increase their effective TLB coverage by mapping key data structures into particular regions of their address space. This is commonly used by DBMS (Database Management Systems).

Paging: Smaller Tables

As discussed earlier, a naive linear page table can take up a lot of memory.
- E.g. a 32-bit address space with 4KB pages and a 4-byte page table entry results in a 4MB page table. And remember we need one per process!
Some solutions:
- Bigger pages: simple but suffers again from internal fragmentation, as a large page may be allocated when only a small piece of it was needed.
- Hybrid: paging and segmentation, using base/bounds registers per segment type (heap, code, stack) and paging within each. Complicated and makes other problems
- Multi-level page tables: Instead of a linear page table, use a 2-level (or more) tree. I.e. page the pages. The top level is the page directory with page directory entries (PDEs). Each PDE points to a page of PFNs. If a PDE entry is valid, there is at least one valid PFN in the page that it points to. If not, then there are no valid PFNs under that address range and the list of PFNs for that PDE does not need to be allocated at all. This results in much better memory footprint.
  - Pro: way better memory footprint.
  - Con: TLB miss now needs two memory lookups: one for the PDE and one for the PFN. +1 for each level of the tree.
  - Used in x86.

Beyond Physical Memory: Mechanisms

Turns out there's an even unhappier path than a simple TLB miss.
It is possible for the processes on a machine to allocate memory that exceeds the bounds of RAM. In this case, pages can be swapped in or out of disk.
We simply add a present bit to the PTE; if that present bit is 0, the PFN is actually a disk address and the OS executes its page fault handler. The process is then blocked so another process can run during the expensive I/O that fetches the page from disk. When the process runs again, the page is in memory and can go through the methodology above.
Typically a background page daemon handles swapping pages in bulk when available memory hits a certain threshold. If a process tries to load a page from disk and there isn't any free spot in RAM, it'll signal to the daemon that it needs room, go to sleep, and trust that when it is executing again, there is room for the page it needs.
If we evict pages naively, we'll end up having to do disk I/O for tons of simple memory fetches, which can lead to programs running 100,000 times slower! So, a lot of thought needs to be put into the page replacement policy

Beyond Physical Memory: Policies

The optimal (but impossible) page replacement policy is to use the page that will be accessed the farthest into the future.
A simple policy that performs poorly is FIFO.
A simple policy that performs somewhat poorly is simply random page eviction. It at least avoids degenerative corner cases.
LRU (Least Recently Used) policy, or an approximation of it, is common in modern systems. It gets pretty close to optimal when workloads exhibit locality. When workloads do not exhibit locality (e.g. loop with randomized array accesses), this performs just as poorly as FIFO/Random.
- Approximate LRU implementation: hardware uses a use bit and a dirty bit, set to 1 on read and write respectively. Given some page candidate to evict, check if use bit is 0, if so then evict. Otherwise, set it to 0 and check the next page, and so on.
  - Optional: check the dirty bit, too. If the page is dirty, it needs to be written to disk prior to eviction, which is expensive.
Thrashing is the term for when physical memory is simply oversubscribed, and requires constant paging.
- Some versions of Linux just have an OOM killer which will kill processes at random to free up memory. If it takes out your graphical display, good luck!
- Another older technique is admission control: block a subset of processes in the hopes that the unblocked ones can fit in memory and finish quickly. (Moral: it's better to do few things well then all things poorly.)

The Linux Virtual Memory System

An overview of Linux on Intel x86.

Virtual addresses have both a user portion and a kernel portion, which stays the same across context switches. Of course, the kernel portions cannot be accessed unless trapped into privileged mode.
Linux has two types of kernel virtual addresses:
- Kernel logical addresses
  - Normal virtual memory reserved with kmalloc sys call.
  - Contains: page tables, kernel stacks, etc.
  - This memory cannot be swapped to disk.
  - Maps directly to the start of physical memory. E.g. logical address 0xC0000000 -> physical 0x00000000, 0xC00000FF -> 0x000000FF.
    - Translations between the two are very simple, so this is sometimes treated as physical memory.
    - Contiguous chunks in kernel logical address space are also contiguous in physical.
- Kernel virtual addresses
  - Reserved with vmalloc sys call.
  - Returns a pointer to a virtually contiguous chunk of memory of the requested size (similar to normall C malloc)
  - Because the memory isn't necessarily physically contiguous, it's easier to allocate large versions compared to the logical addresses above.
x86_64 uses hardware managed 4-level page tables:
- Only the bottom 48 bits of the address are currently in use.
- Each level requires 9 bits, so 36 bits for the four levels, with a 12-bit offset for 4KB pages.
Linux has evolved to support huge pages such as those allowed on Intel x86.
- When programs need a lot of memory, 4KB pages will easily exhaust the TLB, which causes drastic slowdowns. This is greatly mitigated by 2MB or 1GB pages.
- Downside: internal fragamentation if the full page isn't used
- Downside: swapping huge pages can be terrible for performance, requiring large IO actions.
Linux protects against buffer overflow attacks via:
- an NX bit to signify that certain regions of memory are not executable.
  - However return-oriented programming (ROP) gets around this by overwriting the stack to issue a return to certain code that is commonly in scope and available, e.g. such as the linked C stdlib. By cleverly stringing together certain bits of procedures, an attack can be formed.
- Address Space Layout Randomization (ASLR) to place stack, heap, code at random places in the virtual address space, to mitigate against ROP.
Linux more recently uses Kernel Page Table Isolation (KPTI) to keep page tables out of each process, and minimize the kernel memory that is visible to each process, in an attempt to protect against attacks like Spectre / Meltdown
- Such attacks capitalize on weaknesses brough on from speculative execution that can make memory that we thought was MMU-protected vulnerable