TDT4186: Operativsystemer

Tatt våren 2023. Notater fra alle forelesningene + cram sheet. Boka som kapitlene refererer til er https://pages.cs.wisc.edu/~remzi/OSTEP/.

Lecture 01 - Introduction

In its simplest form, an operating system is an abstraction of computer resources. Resources are things like CPU processes and threads, memory allocation, or disk access. The operating system provides common facilities for the same abstraction (e.g: libc)

The operating system provides a few nice things for programs running on the system such as:

Resource management - allocating and giving memory, CPU time, etc to the programs
Protection - prevent programs from attacking or interacting with each other
Fair access - give fair access to resources
System calls: access to different hardware features for programmers

The goals of the operating system are to use the hardware available as efficiently as possible while making it easy and convenient for programmers to write programs to run on the systems. It provides three key things:

Virtualization is the abstraction of the hardware.
Concurrency is solving multiple tasks at once
Persistence is storing data safely and persistently

Lecture 02 - Process

A process is a program that is loaded into memory to be executed on the CPU. Processes are usually represented by a process control block, which is an in-memory representation of the process tree. All processes have a parent, process id, and other states.

The different states a process may be in are:

Ready: waiting to be executed by the CPU
Running: currently executed by the CPU
Blocked: running paused while waiting for I/O to complete
Zombie: running completed, but kept alive for reasons. Maybe you need the PID later?

Spawning processes are done with two main functions; exec() and fork().

fork(): clone the currently running process, and have the clone execute from the next instruction
exec(): create a process with a different program

This combination is powerful, and the reason we use both exec() and fork() is to enable us to pipe and redirect the outputs and inputs of the programs.

LDE is limited direct execution and is about sharing execution time on the CPU fairly between multiple processes. The operating system creates a virtual CPU, appearing to the program as if it has the entire CPU for itself. In between, the CPU will perform context switching between the virtual CPUs.

Of course, direct execution on the CPU would be the fastest, but this has some scary drawbacks:

Hard to protect processes from eachother
Harder to context switch
Permission management becomes a huge security problem

Besides, the OS wouldn’t be able to guarantee that it will get the CPU control back. What if we allowed a program to run, but that program executes forever.

Processes start in user-mode, which is a safe space where permission to do things is very limited. Once we need to perform something in kernel-mode, such as reading from I/O, the OS takes a snapshot of the process, switches the process to kernel mode, performs the operation, returns to user mode, restores the snapshot, and yields the result.

Lecture 03 - Scheduling

Scheduling is the act of determining when, and which processes should get executed on the CPU. There are different policies that can be employed to determine execution order. The most important part is that the OS should be able to guarantee it can get control over the CPU again.

There are also a handful of metrics that are useful when deciding which strategy to use:

Utilization is the percentage of time the CPU is being utilized
Throughput is the number of processes completed over time
Turnaround time is the time required from queuing to completion of process
Waiting time is time spent waiting from queuing to completion
Response time is the time between start of process execution until completeion

There are a few strategies, most notably:

FIFO: The first process into queue runs until completion - no guarantee OS gets CPU back
SJF: Shortest job first puts the processes with shortest estimated execution time first - no guarantee OS gets CPU back
STCF: Shortest time to completion puts processes expected to complete first - still no guarantee of OS return
RR: Round-robin gives each process some execution time, and if it doesn’t finish, put it back into a FIFO queue. This is the optimal solution, but it gives bad turnaround time

Since RR is starvation-free (OS is guaranteed to CPU back), and because it’s easy to implement it is very popular. XV6, MacOS, and Linux all have a variation of RR.

Lecture 04 - Scheduling 2

All kinds of programs use I/O so we need to adapt our scheduling algorithms to account for the fact that processes may not need the CPU at all times.

Let’s treat every CPU burst as a mini-job, and an I/O job to run after each mini job
Use some priority level for the processes (this on its own might cause starvation, but let’s handle that too). Solaris for example uses 60 priority levels. Priority is determined by internal or external factors such as memory requirements, how long the process has run, etc.

Starvation is now handled by increasing the priority of other processes over time. Smart algorithms will optimize turnaround for computation-intensive programs, and minimize for interactive programs.

Multi-level Feedback Queue is an implementation that does just this.

Work with multiple queues, each representing a priority level. Jobs on a given queue are equal in priority
Varies priority of the job based on the behavior of the process. Jobs will move between queues over their lifetime.
New jobs are always placed in the highest priority level
If they take an entire time slice and are still runnable, we drop their priority one level. If the process gives up the CPU (by having to perform some I/O) then it’ll stay on the same level.
After some time $S$ , all processes are moved to the highest priority level.

Alternatively, we can also use some other proportional share scheduler.

Fair share: guarantees that all processes get a share of the CPU
Lottery: each ticket is $\frac{1}{ntickets}$ of the CPU, and processes exchange tickets
Completely-Fair-Scheduler (CFS):
- Choose the process with the lowest vruntime and run it for a non-fixed time slice
- Priority using a user-space “nice value”, in other words, how much the user values the process
- Efficient implementations can use a red-black tree. More on CFS in Lecture 5

Vruntime scales in a proportion to priority and real-time. This means that computation-intensive programs end up with a small vruntime, and finish quickly, while interactive and longer-lived processes take longer. sched_latency is how long a process should run. A typical value is 48 ms. The time slice size then becomes $min(\frac{sched_latency}{nproc}, 6ms)$ .

When a process wants I/O, it sleeps itself and does not accumulate vruntime, and to prevent wakeups from taking CPU, it sets its vruntime to the minimum value in the RB-tree.

This boosts interactivity because interactive process often have short CPU bursts, resulting in a low vruntime.

Lecture 05 - Memory

To run a program we load it from disk into memory. For each process, the operating system has to keep track of the memory allocated to the process. Each process needs memory allocated for its code, some stack memory, and heap memory.

Uniprogramming: only one process at a time is stored (not used)
Multiprogramming: multiple processes are stored at a time

Multiprogramming uses virtualization of memory to let programs run consistently at different real addresses. Compilers will assume that memory starts at 0x000, and then the OS will offset memory access by the processes based on where in memory they actually are. This is called relocation. Relocation comes in a few forms:

Static relocation: memory is simply offset $n$ bytes. This is very simple, but it makes it impossible to rewrite where the memory is stored. How do you give a process more heap?
Dynamic relocation: give it some position in memory, that can be grown and moved. This has the problem of possible fragmentation. This requires hardware support
- Fixed partition: Memory is broken up into fixed-size partitions, and as the process needs more, it can receive more partitions. Can cause internal fragmentation.
- Variable partition: Two registers per process which hold the start and end of their memory segments. Requires MMU unit on CPU to offset. Can cause external fragmentation.

Lecture 06 - Page

To combat fragmentation we can divide the contiguous address space into logical segments which enables us to have the code, stack, and heap at completely different places in physical memory. This strategy is called segmentation, and it by encoding which segment some address is in with the virtual address we can easily locate the physical address.

There are however, some concerns when growing memory and related to processes sharing memory.

Heap usually grows down
Stack usually grows “up”, which (according to lecture) requires hardware support
Segments might be shared across processes, such as code segment. We would want some hardware-level protection to prevent abuse.

Segmentation does, however, still not fix internal fragmentation.

Paging is a form of segmentation, where each page consists of a set of frames, all of the same size. Each frame maps to a physical location, and by tracking everything in a page table, we enable easy lookup as well as tracking of available frames. Paging requires hardware and OS support. We use paging to allow process address space to be non-contiguous.

A virtual address is split into an offset and a page number which our translation unit will be able to use to look up the physical address. The process uses six steps:

Get the page number from virtual address
Calculate address of the page table entry using the page number
Get the page table entry’s physical address
Get physical frame address from PTE address
Determine real address by adding offset
Copy or write to said memory

A concern here is that each read from memory requires two loads, because we first have to look up into the page table. Additionally, there isn’t much to do about internal fragmentation.

Lecture 07 - Page 2

To reduce the amount of loads required to locate the physical address from a virtual one, we can use a Translation Lookaside Buffer (TLB) which acts as a cache layer for the translation unit. The TLB is usually a fully associative cache, requiring the hardware to scan through the entire cache, typically done in parallell, but in return it’s very easy to insert entries into it. Each entry typically has the following properties:

Valid: is the entry valid?
Protection level: may I read/write/exec this memory?
Dirty: have we changed the memory at this position after fetching it?

Process switching messes with the TLB because the OS and hardware need extra work to determine what parts of the cache are suddenly invalid, and what process each entry belongs to. There are a few ways to approach this:

Flush the entire TLB, can be expensive with lots of context switches
Address space identifier (ASID), allowing the OS to determine which process owns the entry.

Additionally, there is still a major problem with memory paging. What if the page table becomes too big? Having one page table per process will end up costly in terms of memory, and scale linearly with the amount of memory available. Having 100 processes with a page table that takes 4MB suddenly requires 400MB of physical memory.

Naive fix: increase the page size, reducing the page table size. Problem here is more internal fragmentation
Hybrid: combine paging and segments. Keep one page table per logical segment, each having base and limit registers to determine how many valid pages a segment has. There will still be a problem with external fragmentation, and it’s a waste when the heap is sparsely used.
Multi-level TLB: Keep a second tier of PTE entries, acting as directories. If the parent is invalid, we don’t have to allocate the children at all. This reduces memory footprint drastically, but also requires a more complex implementation and more bookkeeping.
Inverted table entries: single table for the entire system, where each entry has a PID that owns the memory.
Page swapping: only keep in-use pages in memory. Move unreferenced pages to disk until they’re going to be used again. 90/10 rule claims that 90% of processes spend their time accessing the same 10% of memory. Processes also typically only access a small address space at a time, thus we don’t need to have everything in memory.

Lecture 08 - Page 3?

Husker ikke hva som skjedde her, tror foreleseren var en uke bak pensum eller noe sånt..?

Swapping also allows programs to use more memory than what is physically available on the system. The OS will reserve some space pre-defined user-controllable space on disk as swap space. Swap space is split into segments the size of a page on the system.

A PTE can encode whether the memory is in real memory, or if it is on disk. This is the same way the PTE will encode other things such as user/supervisor and validity in its address.

Page faults are the errors that occur when the present bit on a PTE is zero. We want to reduce the amount of page faults, because they require us to fetch the correct page. This logic might seem backwards, but if we can determine what memory to evict, and what to keep in memory we’ll greatly increase performance. For this, we need a good strategy for loading.

On-demand paging: load page when requested. Cheap, but will result in a lot of page faults
Prefetch: load before referenced. Use some heuristics to determine and predict which pages that will be requested. This is good for patterns such as sequential access, but is difficult to implement in reality.

Similarily, we need a strategy for evicting pages out of memory.

Lazy: Evict pages when the memory is full. This is unrealistic because the OS Needs to restore space for itself or new processes, and requires some very precise bookkeeping.
Page daemon: background daemon that evicts pages when less than $k$ pages are available, evict until $k_2$ pages remain.

Average memory access is used to measure memory access performance, and to increase the this metric, reducing page faults is the way to go. Here we describe some policies to replace pages:

Optimal replacement: replace pages that wil not be used for the longest time in the future. This is impractical because you need to predict what pages that will remain unused.
FIFO: Evict the oldest pages first. This is a fair algoritm, but it’s not uncommon to evict pages that will be used quickly after eviction.
LRU: Evict the least recently used page. Track usage and predict what will be used least. This is close to optimal, but tricky to implement, besides you might also need hardware support to get a good metric.
Random: Evict a random page. Can prove pretty good, but is ad for memory intensive workloads processes since you’ll likely throw out something useful over and over.
LFU: Evict the least frequently used. Requires some heuristics like LRU.
Clock algoritm (second chance): Mark pages with 1 upon allocation, and reduce to 0 upon visit. If a page has the value 0, evict it. This is a fair algorithm that works similar to FIFO, but it has the benefit that it can keep pages alive for a little longer, thus mitigating some of the drawbacks of FIFO.

Lecture 09 - Concurrency

Threads are an abstraction for parallellism at the user-space level. They work in a similar fashion to processes at a kernel level. Each thread has a stack and a set of registers which are swapped when executing. Threading APIs usually expose the following:

Thread creation/joining/exiting entrypoints
Mutually exclusive locking mechanisms
Conditional variables to perform more complex synchronization

On a multi-core system, code could run in parallell and we might have race conditions. To combat this we need mutually exclusive access to memory which is where locks are an ideal abstraction.

A good lock implementation provides mutual exclusion with a fair guarantee that all holders of the lock eventually get access to the resource behind the lock. Additionally it can be useful to do hardware specific triggers like interrupts to increase performance.

We have a few primitives, that are typically implemented on hardware as architecture instructions.

Test and set: return current value and set new
Compare and swap: compare current value with new, and swap
Fetch and add: return current value and increment

All three implementations struggle with spinning, a term used when a listener has to constantly poll to listen for changes, but unlike the first two, fetch-and-add is capable of acting as a fair lock.

GCC implementations: https://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html

Lecture 10 - Lock, Condition Variable & Semaphore

Locks can solve our mutual exclusion problem, but we might want holders of the lock to be ran in a specific order. For this use case, we can use semaphores or condition variables.

Condition variables are syncing primitives that track a list of holders as a queue. Threads can put themselves in the queue, and when the condition variable emits a signal, it will allow the first thread in queue to acquire the mutex.

Dijkstra introduced semaphores which are condition variables with a counter. This is often used in a producer/consumer pattern where actors need to determine when they’re allowed access to a resource.

Lecture 11 - I/O Devices

There are plenty of I/O devices that vary greatly, thus it is imperative that the OS can handle and interact with them. Below are examples of common I/O devices:

Human interfaces (camera, mouse, keyboard)
Transfer devices (bluetooth, network)
Storage (disk, hdd)

These I/O devices may also work differently, or use different ways of interacting with the rest of the computer.

Bus: SAT, PCIe, etc.
Controller: electronic device to manage ports/bus for a type of device
Drivers: interfaces, typically standardized to interact and access the device from the OS

I/O devices typically require input or commands on what to do from the operating system, and return their data or results in a bus or a register that the rest of the system can access.

When discussing scheduling, we mentioned that we might want for I/O operations to complete before we continue running a process. This can neatly done by sending interrupts to the CPU. Using interrupts bring a major benefit because listeners no longer have to poll, they can react to interrupts sent by the devices.

One fault with interrupts is that the OS can become too busy handling interrupts and entering what is called a livelock. This can be mitigated by batching the interrupts.

A Direct Memory Access (DMA) controller can further reduce load on the CPU by directly moving the data from the I/O devices into memory address space. This can be particularly useful when reading from files or other heavy load operations. When the DMA completes, it can interrupt the CPU to let it know it is finished.

Lecture 12 - File Systems

Communication methods for I/O devices are amongst others; I/O specific instructions with IN/OUT port mapping, mapping to registers or directly mapping memory data directly to address space (DMA).

Device drivers typically have the same driver protocols to allow most computer mice, keyboards, and so on to be compatible with most computers. Some vendors still require proprietary drivers, such as NVIDIA graphics drivers or other hardware-specific ones. Over 70% of the Linux codebase is driver code.

A Hard-Disk Drive (HDD) consists of a magnetic disk with a sector-addressable address space. In practice, this means that we have a metal arm that is capable of targeting different parts of the platter. The disk is essentially a list of sectors that the arm can access to read or write data. Each platter has a set of tracks, each containing sectors. Each sector is typically a set size, such as 512 or 4096 bytes.

We can calculate the latency of a disk operation using the following equation.

T_{total} = T_{seek} + T_{rotate} + T_{transfer}

$T_{seek}$ is the time the drive needs to seek onto the platter before the disk rotates to find the final location. Seek latency is typically in the 4-10ms range.
$T_{rotate}$ is the time required to spin the drive such that the arm points at the right sector. Rotation latency is typically in the 0-8ms range, depending on how far it has to rotate.
$T_{transfer}$ is the delay when transferring data onto the disk. This depends on the disk's rotation speed, but typically around a few microseconds.

This leaves the physical movement of the disk drive to be the main cause of latency, meaning the best case is to reduce the rotation required. Disk drives are thus excellent for sequential reads or writes, but terrible for random read.

File systems are collections of files, and vary slightly depending on implementation. Linux systems usually run an ext-family file system, whereas Windows has FAT, and MacOS uses AFS. File systems map paths to actual data. We have a few terms for naming files:

Inode, the unique identifier of the file. Simply just a number
Path, a textual path to where we expect the file to be
File descriptor, an in-memory reference to an inode that can be constructed from a path.

When we fork a process, all the opened files in the process file table have their reference count increased into the new process. The dup function can increase the refcount on a file in a process, creating a new file descriptor pointing to the original inode.

Lecture 13 - File System Implementation

Writing to a file typically uses an in-memory buffer to reduce the number of actual writes. We can use fsync to force a buffer flush. Renaming files does not move any physical data, it only changes the inode path. File operations are generally implemented atomically, and overwriting a file is often done by creating a new file, and renaming it to match the old file. A typical file system looks like this on a disk:

The first block is typically a superblock, containing metadata about the number of inodes, the block size, and the data block count. Useful for any system interacting with it.
Inode bitmap tracks the free inodes
Data bitmap tracks free data blocks
The inode table stores information about each inode on the disk.
Data blocks have the actual file data on them

Directories are also files, they just have a bit flag. Directories are tuples of (path, inode). Inodes have a type, an identifier indicating file owner, permissions, size, time created and edited, an the symlink count, as well as the addresses of the corresponding data blocks. We can calculate the address and sector of a data block using the inode number.

To handle large files, we can create indirections in a data block, making a data block point to other data blocks. This has the benefit that we can create infinitely large files (as long as we have enough data blocks), but it also creates a real cost to navigate through indirections.

Cram Sheet

1. Process (ch. 4, 5, 6)

💡 What is a process? And what is the main difference between a program and a process?

Processes are programs that are loaded into memory and are executable. Programs on them own are useless because they don’t do anything. Each process has a set of registers used for context switching, address space allocated for heap, stack, code, data, etc, an open-file-table and some other properties.

💡 What do fork(), exec() and wait() do? Given a code snippet with these functions, you should know how to analyze the code and decide what is the output

Forking duplicates the currently running process, telling the child process to continue execution from where exec was invoked. Because the OS will “copy” over all the data from the original process to the child, both processes will run the same code. Typically one will use the error code from fork() to do different things depending on whether you’re the child or parent. The parent is returned the PID of the child, and the child is returned 0.
Exec replaces the the current process image with a new process image. This is useful when you want to run a different program in the child subprocess.
Wait suspends execution of the calling process until one of its children has terminated. It’s an alias of waitpid, which takes a PID. A parent can keep calling wait() in a loop to wait for all children to have terminated.

💡 What are the two modes of OSes? Why does an OS need the two modes? How does a process use the two modes?

The operating system usually makes a distinction between user-mode and kernel-mode. The kernel mode has direct access to everything on the system which can pose a security risk. This is why programs are typically ran in user-mode and performs pre-defined system calls (syscalls) to access kernel resources. It uses traps+return-from-trap to increase or reduce the privelige level of the calling process. LDE is a term used when the OS virtualizes the CPU, making it appear to a process that it has all the CPU time, while the OS switches between processes transparently.

💡 What are the two approaches OS can use to switch processes?

A naive approach allows processes to perform a system call to give up their CPU-time, but this would easily leave to starvation. The better approach is using CPU interrupts and timers, to interrupt running processes, allowing others to run which can give a fair CPU share to the processes on the system. This is not a perfect approach, which is why scheduling algorithms have been implemented.

2. Scheduling (ch. 7, 8, 9)

💡 What are turnaround time, waiting time, and response time?

The turnaround time is the total time the process used from entering the system until termination. $T_{completion} - T_{arrival}$
Waiting time is the total amount of time the process spent waiting from when it got queued until it finished. This means all the time waiting in queue, not just the time until the first run.
Reponse time is the amount of time it waited in the queue until the first time it got scheduled. $T_{first_sched} - T_{arrival}$ .

💡 What is starvation in scheduling algorithms?

Starvation in the context of scheduling algorithms is when a scheduling algorithm allows a process to completely starve the remaining processes for execution time. An example of this is FIFO with a process that runs forever. This way there is no guarantee that the other processes will ever run.

💡 What are FIFO, SJF, STCF, RR, MLFQ? What are pros and cons of different scheduling algorithms? Given a job set with all needed information, you should be able to compute average turnaround time and average response time/waiting time for different algorithms

First In First Out (FIFO), or First Come First Served (FCFS) uses a simple FIFO queue to determine which process to run first. Just about the only benefit is that it’s very easy to implement. It can cause starvation and the convoy effect, where smaller workloads are waiting on one gigantic one. If the CPU is using a preemptive strategy starvation may be avoided.
Shortest Job First (SJF) prioritizes the shortest jobs first, solving the issue FIFO has with the convoy effect. It still has the fault of starvation with a non-preemptive implementation. This is also heavily unrealistic as the OS cannot really determine the running time of a program before it runs. Another concern is if more processes arrive after a big job has started.
Shortest Time to Completion First (STCF) resolves the late-arrival fault of SJF by using a preemptive scheduling algorithm that will allow the order of jobs to be changed. This means a late-arriving job can take priority over a running long job. We still live with the assumption that processes always exit, and that we can predict how long a process will run. It also means that three jobs arriving at the same time with the same execution time will still convoy.
Round Robin (RR) uses time slices to give a fair share of CPU to each process in the queue. We need to be careful selecting the time slice length, as too short slices will cause too many context switches, and too long will hurt response time. Unfortunately RR suffers at turnaround time, because the turnaround time suffers because all processes need to run inbetween.
Multi-Level Feedback Queue (MLFQ) combines all of the aspects we’ve mentioned above, and uses different tiers or priority levels to re-arrange the priority processes should have in the queue. It tries to optimzie turnaround time by reducing the priority of tasks that appear to take long. The system assumes interactive tasks use short burts of interactivity, while longer running tasks just run for as long as they can.
- If two processes are in the same priority level, run them as Round-Robin
- If two processes are inequal in priority, run the highest priority first.
- Processes enter the queue at highest priority
- The process is alloted an amount of time at each level. Once this time has been used, drop priority, regardless whether the process gave up CPU or not.
- After a given amount of time, move all processes to highest priority. This avoids starvation, and if a process has become interactive after some time, it will receive the priority boost of interactive programs

💡 What is the default scheduling algorithm of Linux? How does it work?

Linux uses its Completely Fair Scheduler (CFS). CFS is based on a counting-technique called vrutime. As a process runs, it accumulate vruntime, and when the next scheduling decision occours, the scheduler picks the process with the lowest vruntime. Determining the time slice is done with sched_latency. $min(\frac{sched\_latency}{nproc}, 6ms)$ . After each switch, the previously ran process has its vruntime increased. Using the weighted priority of a process in the nice table, processes can be given priority. Linux uses a red-black tree to easily determine what process to run next. When a process comes back from sleep, its vruntime is set equal to the lowest value in the tree.

3. Memory (ch. 13, 15, 16, 17)

💡 What is the address space of a process? What does an address space of a process usually consist of?

The address space is the memory space and its addresses that are allocated (dedicated) to a given process. The address space can grow, and typically contains a segment for code, which is the program instructions, stack for local values, a heap for dynamically allocated values, and most often a data segment for readonly data. Stack grows upwards, heap downwards.

💡 What are virtual address and physical address?

Memory is typically virtualized by the operating system for both easy of use and security reasons. This way arbitrary processes cannot read memory from other processes or even worse, the kernel itself. The virtual address is the address the process sees, which points to some location in memory. The virtual address is translatable to the real physical address in memory the data is located at. The OS, and often in combination with hardware is responsible for translation the virtual address to the physical one. Efficiency is achived using TLBs, etc.

💡 What are static relocation and dynamic relocation? How do they work respectively?

Static relocation appends an offset of $n$ bytes to the virtual address to calculate the physical address. It is impossible to resize growable memory like heap and stack with static relocation.
Dynamic relocation uses a base and bound register for each CPU, allowing us to place the address space anywhere inbetween those two addresses. Upon program load, the OS sets the base/bound registers accordingly. This of course requires hardware (MMU) support. Can cause internal fragmentation.

💡 What are the two memory allocation methods when allocating contiguous memory space? What are their respective pros and cons? (This is not in the textbook, but you can find this part from the lecture slides.)

These are both strategies that can be used with dynamic relocation to allocate contiguous virtual memory.

Variable Partition uses the strategy described by dynamic relocation. The major problem with variable-size partitions is external fragmentation. A terminated big process might have allocated more space than a new smaller process would require, leading to fragmentation.
Fixed Partition breaks up physical memory into fixed-size partitions, requiring only one base register for each process. This is similar to static allocation, but this also means that there is a hard limit on how much the contiguous space can grow. In cases where the memory segment would require a re-size into another process’ memory we might have to re-allocate.
- Additionally there is the risk of internal fragmentation because partitions are of fixed size
- There is also a case of external fragmentation when processes are removed. Perhaps the free’d up space is not big enough for a new process? Though this is often solvable through re-allocation.

Compaction can solve both internal and external fragmentation, but is very expensive and heavy.

💡 How do different allocation algorithms, such as FF, BF, and WF, work when allocating memory space? Given a set of memory partitions and a set of processes with different memory requirements, you should know what would be the final allocation using a certain algorithm?

These are all algorithms for when you allocate a space.

First-fit allocates and acquires the first memory hole that is big enough. May still cause external fragmentation, but shouldn’t mean internal fragmentation.
Best-fit allocates and acquires the smallest hole that is big enough. This will hopefully avoid internal fragmentation, but might leave an even smaller hole, still causing external fragmentation
Worst-fit allocates and acquires the largest hole. This will fix external fragmentation, but can lead to internal fragmentation

💡 What is segmentation? How does it work?

Segmentation splits up address space into contiguous blocks of memory, each having a base and size register. This allows us to store heap, code, and stack at different places in physical memory. We can encode the real address by using the first few bits of the virtual address to denote the segment number.

This also has the benefit of code/data sharing where two processes running the same program might want to share the same address space for code/data.
The stack grows backwards, we can use an additional bit in the virtual address to signal which way the offset should run.
Does not fix internal fragmentation, but should prevent external fragmentation.

4. Page (ch. 18, 19, 20, 21, 22)

💡 What is page in OS? How does paging work in OS?

Paging is similar to segmentation, but in this case, all partitions of memory are the same size. This way we can completely eliminate external fragmentation. Physical memory is split into frames, each which map to a single virtual-memory page. It also makes it easier to manage, as the OS keeps a free-list of virtual pages. Typically we have that each process has a page table, which describes which of its virtual pages are mapped to which physical pages.

💡 How to use a virtual address to conduct page translation? Given a virtual address space and page size, how many bits do we need for page number and offsets, respectively? Given a memory system with all paging information, you should be able to translate some given virtual addresses to corresponding physical addresses.

The virtual page number is encoded into the virtual memory address. We typically encode the page number in the highest bits. To translate, you also need the mapping between Virtual Page Number (VPN) and Physical Frame Number (PFN).

The amount of bits required for the page number is however many is required to encode the page number
The amount of bits required for the offset is how many is required to encode the page size.

💡 What is page table?

The page table contains a mapping between the virtual pages allocated by the OS, to the physical frames in memory. Descriptions often say a mapping between virtual address and physical address, but the page table doesn’t need to hold this as it can easily be calculated.

Page Table Entries (PTE) are entries in the page table that encode information about the physical frame number, and often a few other data points such as user/supervisor and permissions onto the entry.

It also has a valid bit to indicate whether the particular tyranslation is valid.
A present bit is used to determine if it’s physical in-memory or if it’s swapped onto disk.
Dirty bit used to represent whether the page has been written since it was brought into memory.
A referenced bit can be useful in page replacement policies

💡 What are the two problems of paging system without hardware assistance and solutions for the two problems?

Each data access requires two reads because we need to access the page table, and then the actual data we wanted to read. This can be mitigated by introducing a cache such as a TLB.
The page table becomes too big. Since each process has a page table, we can quickly end up consuming a lot of memory for the page table. There are two main approaches here:
Inverted page tables: keep only one page table for the entire system, and associate each process in the table entry information.
Multi-Level TLB: split pages into a two-tiered tree where the first level is pointers to other PTEs. This means we don’t need to allocate the unused pages until they’re used, drastically reducing the amount of memory required. Assuming each PTE can hold pointers to 10 more, we save up to 100 times the space for unused memory.

💡 What is TLB? Why do we need TLB in paging system?

Translation Lookaside Buffer (TLB) is a memory cache, typically implemented as hardware to cache virtual-to-physcial translations. Translation requires a fair amount of calculation since we’ve packed so much data into the PTE.

💡 What is page replacement? What are page hit and page miss?

Page replacement happens when we introduce page swapping. This allows the system to have keep more virtual memory than physically available memory. Page replacement happens when we exhaust the physical memory and require the swapping system.

Page hits are when the page we’re trying to access is in-memory (not evicted to swap)
Page misses are when the page is already evicted, and we need to replace an existing page with the evicted one. This is also called a page fault.

💡 How do different page replacement policies/algorithms work, e.g., OPTIMAL, FIFO, LRU, etc? Given a page replacement policies and a page access sequence, you should be able to calculate the average memory access time.

Most systems keep a high/low watermark level to determine how many pages that should be loaded into memory, and how many should be on disk at any time. Page eviction is usually done by a daemon process which evicts once it notices the amount of pages in memory exceeds the high watermark. Eviction comes in a few forms:

Optimal replacement: Replace ones that will not be used for the longest time in the future. Unrealistic, but a good benchmark
FIFO: Oldest one goes first, very fair but can often evict pages that will be used
LRU: Least recently goes first, very tricky to implement and might require some hardware support for good metrics.
Random: Evict a random page, can perform well, but sucks for memory intensive workloads
LFU: Evict least frequently used page, again tricky to implement due to needed heuristics like LRU.
Clock Algorithm: Give pages a second chance by marking them with 1 life upon allocation, and reduce to 0 upon visit. Any visited page that was 0 is evicted. Mitigates the FIFO flaw.

5. Concurrency (ch. 26, 27)

💡 What is thread? What are the differences between thread and process?

A thread is typically a user-space abstraction over concurrency and parallell programming. Processes are managed by the operating system, and in some cases can enable concurrency, but this is completely out of control of the user. A thread works similar to a process, but it’s typically implemented at user-level. A lot of the mechanisms are similar.

💡 What is POSIX? How do some common thread APIs from POSIX work, like pthread_create()?

POSIX is a set of standard APIs for operating systems, heavily based on Unix-like systems. pthread is a threading library in the POSIX standard.

pthread_create() creates a new thread, taking a function and arguments to pass to the function
pthread_join() waits for a thread to finish, returning the result
pthread_exit() kills the thread

💡 What is the concurrency issue? Given a code snippet with threads, you should be able to analyze the output and identify the possible issue.

The concurrency issue is that two threads write to the same data at the same time, causing the behavior of the program to be inconsistent. The following program illustrates this problem. Running this program multiple times will print out different values for x.

#define NULL 0
#include <stdio.h>
#include <pthread.h>

volatile int x = 0; // note: not _Atomic
void* run_thread(void*) {
  for (int i = 0; i < 10000; i++) x++;

  return NULL;
}
int main() {
  pthread_t a, b;
  pthread_create(&a, NULL, &run_thread, NULL);
  pthread_create(&b, NULL, &run_thread, NULL);

  pthread_join(a, NULL);
  pthread_join(b, NULL);

  printf("%d", x);
}

💡 What is the critical-section problem? What should a solution for the critical section problem contain?

The above code demonstrates a race condition, where two pieces of code race against eachother to modify the x variable. The solution to the critical-section problem or the race-condition problem is using a form of synchronization such as locking.

6. Locks (ch. 28)

💡 What is lock in OS?

Locks are a system to ensure that only one accessor (consumer) can access or use the resource hidden behind the lock at once. Once a consumer acquires the lock, nobody else is allowed to touch it until the original consumer releases it, at which point another consumer may acquire it. POSIX pthread library provides some mutex locks. The goal is to provide mutual exclusion.

💡 Why do locks need hardware instructions, like Test-And-Set() and Compare-And-Swap(), to implement locks?

A mutually exclusive lock needs to be atomic. Simple loads and stores are insufficient, because if an interrupt happens inbetween two acquisitions, both threads will be running, and both will set the flag to one.

Test-And-Set (AtomicXchg) gives you the old value, while setting the new one. This has to happen atomically and is implemented as a hardware instruction. This prevents the above flaw, because this cannot be interrupted, as it’s one operation.
CompareAndSwap (AtomicCmpXchg) gives you the old value, and if the new value is not equal to the old one, change it. This is very similar to TestAndSet, except it only changes the value if the values are not equal.

💡 What is spinlock? What are pros and cons of spinlocks? What is the method to address the performance issue of spinlocks?

Spinlocks just “spin”, in other words, run a while loop with an atomic instruction such as Test-And-Set or Compare-And-Swap until the lock is available. The performance loss here is that the spinlock just eats all the CPU. The method to address this is to just yield the CPU until the lock is available.

💡 What are ticket lock and queue lock?

This uses another atomic primitive, Fetch-And-Add, similar to Test-And-Set, which instead of assigning a new value, just increases the value. Each consumer gets a ticket, which is a number. Each time the lock is released, the lock’s current ticket number increases, allowing that consumer to run. A problem is that this is suspectible to integer overflow.

7. Condition Variables (ch. 30)

💡 What is condition variable? Why do we need condition variable?

Condition variables are used to synchronize between threads in cases where threads want to await a condition to be true before continuing its execution. A parent thread might want to wait for a child to complete before continuing (in the form of a join). A condition variable is a queue, waiting for a signal.

pthread_cond_init: create a new condition variable
pthread_cond_destroy: free a condition variable
pthread_cond_wait: wait on a condition variable, registering yourself on its “waitlist”
pthread_cond_signal: release one thread from the “waitlist” signaling it can continue

💡 What is producer/consumer problem? How to implement a solution for P/C problem using condition variable? Given a code snippet, you should be able to analyze it

The bounded-buffer or producer/consumer is when one producer and 2+ consumers attempt to synchronize over a shared bounded buffer. How do we organize such that the consumers know when to wake up and when to sleep while waiting on data from the producer? The answer is using two condition variables, one for filling and one for emptying. This way consumers are never able to wake up other consumers. Always use loops in code like this.

8. Semaphores (ch. 31)

💡 What is semaphore? What are the operations semaphores have?

A semaphore is an object with an integer attached to it. The value is used to achieve synchronization.

sem_init() creates a new semaphore
sem_post() increment the semaphore value by one, waking up waiting thread
sem_wait() decrement the value of the semaphore by one, wait if the value of the semaphore is now negative

The initial value passed to the semaphore can be thought of as how many units of resource you are willing to give up right away.

💡 How to use semaphore to implement solutions for mutual exclusion, ordering, and P/C problem? Given a code snippet, you should be able to analyze it.

Mutual exclusion between two threads is acquired with a semaphore of initial value 1. When entering the critical section, call sem_wait() reducing it to 0. If the second thread attempts to enter, it will wait because the semaphore is now negative. Upon first thread’s call to sem_post(), the second thread can begin work.
Ordering can be implemented similar to condition variables. Letting a parent wait for a child to complete is as simple as a semaphore with the value zero. The parent puts a sem_wait() onto the semaphore, and when the child completes, it sem_puts(), allowing the parent to continue
P/C or bounded buffer problem can be solved similarly using semaphores as it could with condition variables. The added benefit is that semaphores already include a parameter for how many resources we are willing to track at once. This means we can track the entire buffer at once.

💡 What is deadlock? What are the four conditions for deadlocks?

Deadlocking happens when every thread in a set is waiting for an event that can only be caused by another thread in the set. There are four conditions that need to happen for a deadlock to occour:

Mutual exclusion: a thread grabs exclusive control over a resource, in other words, a thread acquires the lock.
Hold and wait: some other threads need to attempt to acquire the resource held by the first thread.
No preemption: resources cannot be forcibly removed from threads that are holding them
Circular wait: there is a circular chain where each resource is waiting for one held by the next thread in the chain

In other words, there is a circular dependency amongst the threads where no thread is willing to give up the resource because it’s waiting on something else.

9. I/O Devices (ch. 36, 37)

💡 Why do I/O systems deploy a hierarchical structure?

To allow devices that demand higher performance are physically located closer to the CPU. Nowadays that doesn’t really matter that much with a lot of I/O being point-to-point, but it’s also more logical that the demanding units have easier access to the CPU and system. It’s rather unimportant that your USB drive has high priority to the system.

The faster a bus can be, the shorter it must be.

💡 What are port, bus, controller, and device driver?

A port is a single connection point for a single I/O device, or a single connection point for an I/O device. IN/OUT port mapping
A bus is a pathway for one or multiple devices to communicate with the rest of the system.
Bus controllers manage the participants on a bus for a certain type of bus/device. The SATA Controller is an example controller for a SATA bus.
Device drivers are software that provide a programmable interface to the physical device, whether that’d be directly over port or through a bus. We create standard protocols such that as many devices as possible can use the same drivers.

💡 What are I/O interrupt and polling? Pros and cons.

Polling requires near-constant CPU usage, because we’re constantly checking if the awaiting state has changed. Interrupts give the benefit that you can put the listener to sleep until the interrupt signal arrives. Be careful of interrupt livelocking. Batching interrupts can help mitigate this.

This difference is significant in the context of I/O devices, because this means the CPU doesn’t constantly have to wait for an I/O operation to complete, it can simply be notified once it’s done.

💡 What are I/O instructions and memory-mapped I/O?

I/O instructions are instructions the CPU can send to the I/O device to perform an I/O task, such as reading from a disk. When performing CPU instructions, the CPU will write to the target device registers.
Memory-mapped I/O is an approach where the hardware makes device registers available as if they were memory, and when performing load/stores the hardware routes this to the device instead of main memory.

Both approaches are still in use today, though I/O instructions are an older concept.

💡 What are programmed I/O and DMA?

Programmed I/O is a term for describing CPU interacting with a device, typically using different status/data registers from the device, and communicating over a set protocol. Here the CPU will either poll or wait for an interrupt for the I/O to complete
Direct Memory Access (DMA) is an additional hardware device that can optimize and improve the speed of I/O operations as it’s capable of writing the I/O data directly into memory which the CPU can be notified about by interrupt, and instantly have access to, allowing the CPU to do other things while I/O is fulfilling.

💡 What are the key components of hard disk drives? Given a hard disk specification, you should be able to calculate average access time.

A hard-disk is a set of magnetic disks that spin, with a head on each platter that read/writes to the location it’s pointing at. Each platter consist of many tracks, each which have a set of sectors. A sector is the smallest unit in the hard disk. The total access time for any given hard disk at $k = RPM$ is found by

T_{total} = T_{seek} + T_{rotate} + T_{transfer}

Where the variables are defined as:

$T_{seek}$ the time required for the head to move into position on disk, typically in the 4-10ms range. Specification should say.
$T_{rotate}$ is the time required for the disk to spin around, such that the seeking point is positioned correctly on the rotational axis. Typically between 1-8ms, depending on how many degress the disk had to rotate. Time for a single rotation is found by dividing the RPM over seconds. $60s / 7200rpm \approx 8.3ms$
$T_{transfer}$ is the delay transferring data from or to the disk. Depending on the transfer medium speed. Find the time per megabyte and multiply by the hard-disk sector size.

10. File Systems (ch. 39, 40)

💡 What are file and file system?

A file is in its simplest form, just a segment of bytes associated with an inode in a file system. The file system also keeps metadata about the inode, such as permissions, creation and update time, size and file owner.

The filesystem tracks a hierarchy/tree of inode entries (files). Most operating systems use different filesystems. Linux uses ext-family ones, Windows uses FAT, and MacOS uses AFS. File systems map paths to inodes.

💡 What are the three names OS uses for a file?

Inode: unique identifier across the entire file system
Path: textual description of where the file lies in tree
File descriptor: in memory pointer to an inode in the file system, constructable from path.

💡 What do the common file operations, e.g., open(), read(), etc, perform? What is the offset of a file and how do different file operations change the offset? Given a code snippet, you should be able to analyze it.

open() creates a file descriptor for the inode at the given path. Specify what permissions you want, rwx
read() reads $n$ bytes from the file, moving the file descriptor offset
write() writes to a buffer, before writing to the file. Flushable with fsync()
rename() changes the inode path, but does not physically move the file data
dup() creates another reference to the file target file descriptor. This one shares flags and offset with the original one.

💡 How is the on-disk structure organized?

Superblock comes first, containing metadata about all the inodes and data blocks available on the system, as well as data block size.
Inode bitmap tracks all the free inodes
Data bitmap tracks free data blocks
The inode table stores information about each inode on disk, and points to a set of data blocks
Data blocks have the actual data in them, but can also act as a proxy/indirection for large files.

💡 What is directory?

A directory is a file with a certain flag enabled. Directory data is a set of tuples of $(path, inode)$ which tells us which inode to visit to traverse further down the tree.

💡 How do different file operations access the on-disk structure?

open() requires you to traverse down, reading the permissions of each directory inode, plus the inode itself to ensure you have permission to access the file with the wanted permission level.
read() will consult the inode to find the data block for a given offset, and then read the data block at the address given by the inode, and update last-access-time of the node upon completion
create() first traverses like open, then determines if it needs to write a new data block, in which case it updates the inode bitmap. it then writes to the bitmap, and finally the data. Afterwards it has to update both the inode itself, and its parent directory to update access/creation times.
write() needs to read the inode to find out where to write, consult the data bitmap to find the address, then update the bitmap, write the actual data, and finally update the inode

Lecture 01 - Introduction​

Lecture 02 - Process

Lecture 03 - Scheduling​

Lecture 04 - Scheduling 2​

Lecture 05 - Memory​

Lecture 06 - Page​

Lecture 07 - Page 2​

Lecture 08 - Page 3?​

Lecture 09 - Concurrency​

Lecture 10 - Lock, Condition Variable & Semaphore​

Lecture 11 - I/O Devices​

Lecture 12 - File Systems​

Lecture 13 - File System Implementation​

Cram Sheet​

1. Process (ch. 4, 5, 6)​

2. Scheduling (ch. 7, 8, 9)​

3. Memory (ch. 13, 15, 16, 17)​

4. Page (ch. 18, 19, 20, 21, 22)​

5. Concurrency (ch. 26, 27)​

6. Locks (ch. 28)​

7. Condition Variables (ch. 30)​

8. Semaphores (ch. 31)​

9. I/O Devices (ch. 36, 37)​

10. File Systems (ch. 39, 40)​