Exam 2 study guide
The one-hour study guide for exam 2
Latest update: Thu Feb 23 19:03:38 EST 2017
Disclaimer: This study guide attempts to touch upon the most important topics that may be covered on the exam but does not claim to necessarily cover everything that one needs to know for the exam. Finally, don't take the one hour time window in the title literally.
Linking and Libraries
A CPU executes a process by reading instructions. These instructions reside in memory. They, in turn, may request the processor to read and write other memory locations. Each instruction and each datum is read from a memory location, an address. Addresses may be either absolute or relative. An absolute address is the actual memory location that is being referenced. A relative address is expressed as an offset to the contents of another register. The most common relative addressing mode is PC-relative, where the address is expressed as an offset to the program counter (also called instruction pointer).
An executable program is typically built from several source files and makes references to library functions. A linker is responsible for combining these separately-compiled files and filling in undefined symbols in one module with their definitions from another module. Static linking is the case where a linker resolves all symbols and includes all the code that is needed to run the program within the executable file. The operating system's program loader can load it into memory and run it.
Dynamic linking is linking where certain libraries are not incorporated into the executable file but are loaded on demand when the program is run. The linker in this case links the program with stub libraries — replacement libraries whose only task is to load the actual library. Dynamic linking enables the use of shared libraries. The same block of code containing the library can be referenced by multiple processes that need it. If a process requires a shared library and it is not present, the program loader will allocate memory and load it from the library file. If, on the other hand, it is already loaded into memory, the program loader will patch the process' references to the library that is already in memory. Because a does not know where it will be loaded in memory, libraries have to be compiled to use relative addressing (called position independent code). The operating system keeps track of the number of references to each shared library. When the last process that uses a shared library terminates, the library is unloaded from memory.
The benefit of dynamically-linked shared library is twofold: the executable program file shrinks because it does not need to contain code for all the libraries that it uses and memory is used more efficiently in the system because each process does not need to have its own copy of the library code in memory.
The most rudimentary approach to dealing with memory is single partition multiprogramming, where we allow only a single program to be loaded in memory at any given time. The program shares its memory with the operating system and nothing else. This was the approach adopted in early computers and early personal computers prior to time sharing.
Since we want to be able to switch between multiple executing programs, we need to have those programs loaded into memory. The more programs are loaded in memory, the greater is the chance that one of them is ready to run, as opposed to being blocked waiting on I/O. CPU utilization is the estimated percentage of time that the CPU is not idle. The number of processes in memory at one time – and candidates for running – is known as the degree of multiprogramming.
If multiple programs are loaded into memory at once, they will, of course, occupy different memory locations. Absolute code uses real memory addresses and expects to be loaded into and run from a specific memory location. Position independent code uses relative addresses throughout and may be loaded into any memory location. Of course, there's a risk that some absolute memory references may have been used, in which case the process may accidentally or maliciously access data that belongs to another process. Another approach is not to rely on the compiler to generate position independent code but instead leave memory references blank and generate a relocation table that will allow those references to be filled in when the program is loaded.
A completely different approach to this problem is to rely on hardware to dynamically transform address references made by the process into actual memory locations. This is known as logical, or virtual addressing. A memory management unit (MMU) performs this translation. The simplest approach to logical addressing is to add a base address to each memory reference made by a process. This address will be different for each process and will be based on where in memory that process was loaded. The MMU can also check the virtual address against a limit to ensure that the process is accessing only memory within the process. This is base & limit addressing. Some hardware may support breaking up a process's memory into several logical segments (code, data, stack) and each such segment is assigned its own register containing the base offset address for that segment. A logical address is effectively a combination of a segment identifier and an offset within that segment. The Intel memory model refers to the combined segment value and offset as a far pointer. This system is known as segmentation.
The base & limit addressing approach requires each process to occupy contiguous memory. Segmentation allows several variable-sized chunks of contiguous memory but each segment (e.g., stack) must occupy a contiguous region. With multiple fixed partitions, the operating system divides memory into a bunch of partitions ahead of time. When a program needs to be loaded, it is loaded into an available partition that can hold it. If no partition is available, the loading will have to be delayed in a queue of requests. Unused space within a partition remains unused. This is internal fragmentation. If any partitions have no programs to run then their memory is always wasted. This is called external fragmentation. With variable partitions, the operating system creates a partition for a process on demand, using a block of contiguous memory that is big enough to hold it. As processes come and go, holes (unallocated regions) develop and there may not be new processes that fit into any available hole. Memory compaction can relocate processes in memory but is a time-consuming approach. If a process grows, it will need more memory and the operating system will need to find a bigger unallocated block and move the process, relocate other processes, or save the process's memory onto disk and wait until a big block frees up. All of these hugely impact the performance and responsiveness of a multitasking system.
Page-based virtual memory
With a page-based memory management unit, physical memory is divided into equal-sized chunks that are multiple of a power of two (that way, we can use a fixed number of bits to address a byte offset within the chunk). These chunks are called page frames. A logical memory address is divided into two components. The high-order bits of the address identify a page number and the low-order bits identify a page offset, the byte offset within a page. For example, a 32-bit address with 4 KB pages will use the lower 12 bits (212 = 4096) to identify the page offset and the 20 high bits to identify the page number. For each memory reference made by the processor, the memory management unit takes the logical address, extracts the page number, and uses that as an index into a page table. Each page table entry (PTE) contains a page frame number. The physical memory address generated by the MMU has the page number from the logical address replaced with the corresponding page frame number from the PTE. A direct mapping paging system contains a page table per process. When the operating system performs a context switch to run a specific process, it loads a page table base register with the address of that process' page table. Each PTE in a page table contains a page residence bit that indicates whether that specific page is mapped onto a page frame. If it is, then the PTE also contains the page frame number that corresponds to that page. If the page is not mapped onto a page frame, then a page fault is generated. This is an interrupt that is handled by the page fault handler of the operating system. The fault handler will examine the memory address of the instruction that faulted and decide whether the process needs to be terminated because of an invalid memory access or whether the access was indeed valid but the requested page was not mapped onto a page frame. The PTE also contains other flags, such as whether the page is writable, executable, and whether it has been accessed or modified.
If we had to consult a memory-based table for every memory operation, the memory system would effectively be twice as slow since each memory access would require another memory access to look up the location of the page in the page table. To improve performance, the MMU employs an associative cache known as a translation lookaside buffer (TLB) to store recently-accessed page numbers and their corresponding PTEs. Since the corresponding PTE is a function of the per-process page table that we used to do the lookup, most MMUs employ an address space identifier (ASID) in their TLB to identify the process to whom an entry belongs. During a context switch, the operating system must set the ASID so that only cached values for that specific ASID will be used. The hit ratio is the percentage of memory references that can be satisfied from the TLB without resorting to a page table lookup.
A page table allows the process to potentially be able to address all system memory. Much of that memory will not be used by the process but we still pay the penalty of having to maintain a possibly large page table (our earlier example of 20-bit page numbers means that we need a table with 1,048,576 entries. To save on memory, we can use a multilevel page table. The virtual address is split into three or more parts. The lowest bits still give us the offset within a page and page frame. The very highest bits serve an an offset into a top-level index table. Each entry here contains the base address of a partial page table. The next set of bits of the virtual address serves as an offset into this partial page table. Entries in this table are PTEs that include the page frame address and various access flags.
One approach to deal with very large address spaces is the inverted page table. Instead of having a per-process table that maps logical pages to physical page frames, we have a single table with one entry per page frame. For a given virtual address, we search this table for a page frame that contains that virtual address. Clearly, searching a table is not good for memory access performance. We need to rely on an associative cache and the use of hashing to perform a lookup.
A paging system has a number of attractive features:
- It does not let the amount of physical memory limit the size of our process.
- It allows the process feels like it owns the entire address space of the processor.
- It allows memory allocation to be non-contiguous and has no external fragmentation problems, simplifying memory management and obviating any need for compaction.
- It lets us keep more processes in memory than the sum of their memory needs so that we can keep the CPU utilization as high as possible.
- It lets us us keep just the parts of a process that we're using in memory; the rest may be stored on a disk.
Systems such as the Intel IA-32 architecture employ a combined segmentation and paging system. A logical address is offset by a segment base register. This resultant address is a linear, virtual address and is looked up in a two-level page table to generate a physical address.
Demand paging refers to the technique of allocating and loading memory pages on demand; that is, when a process makes a request for that region of memory. When a process makes a memory reference to a page that is not mapped onto a page frame, the MMU generates a page fault. The page fault handler checks whether the reference is a valid memory reference or not. If it is not valid then the process is terminated. If the address references executable code or static data, then those contents are present in the program file on the disk and are read from there. If the memory reference is to a stack or heap that grows then a new frame is allocated and initialized. Finally, if the memory reference refers to stack or heap data that was written out to a page file disk because the memory manager needed a free page frame, then a page frame is allocated, the disk read is scheduled, and the operating system context switches to some other process while this one waits for the memory contents it needs to be loaded into memory.
Page replacement is the process of finding a page to save out onto disk to create a free page frame. Since it requires disk activity, which is much slower than memory, the hope of an operating system is to minimize paging — shuffling page contents between memory and the disk. Ideally, we would like to use a least recently used (LRU) algorithm to select a page for replacement. The snag is that there is no way to keep track of use counts in an MMU so we need to find an efficient algorithm that comes close to LRU. Memory management units have a "referenced" bit that tells us if a page was referenced. The clock algorithm (also known as the second chance algorithm) looks at page frames as a circular list. When a page needs to be selected for eviction, the algorithm starts at its current position and looks for page frames whose referenced bits are 0 (they have not been referenced in a while). As it does this, it sets the referenced bits of pages it skipped over to 0.
The nth chance replacement algorithm is similar except that it keeps a counter per entry. When we search for a page, if a page frame's referenced bit is 1, we set it to 0 and clear the counter. If the page frame's referenced bit is 0, we increment the counter. We continue with this until we find a clear page whose counter is greater than or equal to some value N. This algorithm effectively counts the number of times that we examined this page and it has not been referenced.
As a process executes, it accesses certain memory locations. Over some small interval of time, a process is unlikely to reference all of its memory. Instead, there's a a principle of locality at work: a process tends to reference the same pages over and over before moving on to a few other pages. This set of pages is known as the process' working set. Some processes may have bigger working sets than others. In all cases, For good performance, we want to have each process' working set in memory. If this is not the case then the system will exhibit thrashing: constantly paging to and from the disk. A process' resident set is the set of pages that is currently mapped to page frames. The working set model approximates the locality characteristics of a process to try to identify the set of pages that a process needs to have its working set in memory and avoid thrashing. One way to manage the resident set is by monitoring page fault frequency. If a process does not have its working set in memory, it will generate page faults. This is an indication that it needs more memory. Conversely, if a process is never generating page faults, that could be an indication that it has too much memory allocated to it. We can set thresholds of page faults to decide what percentage of available page frames should be allocated to a specific process.
A block device provides structured access to the underlying hardware. Block devices are devices that can host a file system. They provide addressable (block numbers) I/O of fixed, equal-sized chunks of bytes. Since block data is persistent, it is suitable for caching in memory. The buffer cache is a pool of kernel memory that is allocated to hold frequently used blocks from block devices. By using the buffer cache, we can minimizes the number of I/O requests that actually require a device I/O operation.
A character device provides unstructured access to the underlying hardware. Examples of character devices include printers, scanners, mice, and a video frame buffer. Unlike a block device, there is no concept of seeking: addressing data by its location. Whereas in a block device, we may have a concept of, say, block 1,523 containing a chunk of data that will never change unless we change it, a character device presents a continuous sequence of bytes.
A network device is similar to a concept device except that instead of sending and receiving a stream of bytes, it sends and receives bytes in chunks (packets).
On POSIX systems, the kernel maintains two tables to keep track of devices: a block device table and a character device table. The major number is an index into either the block or character device table and allows the kernel to find the set of functions that are associated with that device: the device driver. The minor number is interpreted within the device driver. It usually identifies which specific instance of a device is being accessed. For example, SATA disks will have one functional interface (driver) but multiple disks. The major number will identify the SATA driver and the minor number will identify a specific disk.
Devices, particularly on POSIX-based systems, are usually presented as file names via the hierarchical file system name space. A device appears as a file within the file system. It can have an arbitrary name and there is no data associated data with this file but the file's metadata identifies the type of device (block or character) and the device's major and minor numbers, which identify the specific instance of the device.
By placing devices in the same hierarchical name space as files and ensuring that each device must implement a well-defined set of I/O operations the operating system can provide a degree of transparency where applications can often be unaware of whether they are interfacing with a file or a device. For example, the shell can just as easily read commands from a terminal window (device) as it can from a file containing a script. When a file is first opened, the kernel can check the file's attributes to determine if it is a device file. If so, it reads the type of device (block or character) and its major and minor numbers from the file's metadata. The kernel then sends operations to the driver for that device. The underlying file system where that device file resided does not get involved.
Any kernel code is running in one of three contexts:
- interrupt context: this is invoked by the spontaneous change in the flow of execution when a hardware interrupt takes place. Any actions performed in this context cannot block because there is no process that requested the action that could be put to sleep and scheduled to wake up when an operation is complete.
- user context: this is the flow of control when a user process invokes a system call. The mode switch to kernel mode took place but the context is still that of a user process. If needed, the kernel may schedule an I/O operation, put the process in a blocked state, and context switch to another process.
- kernel context: the kernel itself has one or more worker threads that it schedules just like any other process. Even though these have no relation to user threads, they have a context and may block. Note: the kernel context is not the context of a process that has just made a system call; that is user context running in kernel mode.
Device drivers are modular and can be compiled into the kernel, loaded at initialization, or dynamically loaded at any time in the future. Upon initialization, they register themselves with the kernel's interrupt handler. When a specific interrupt occurs, the kernel handler calls the appropriate interrupt handling function in the driver. To ensure that the interrupt context does not take too long (since it cannot block), interrupt handling is split into two parts:
- The top half of a driver is the interrupt service routine that is registered with the interrupt handler. It tries to do as little as possible — generally grabbing data, placing it into a work queue (a buffer), and scheduling bottom half activity.
- The bottom half of a driver is the part that is scheduled by the top half for later execution and does much of the real work. Because it runs in a kernel thread (that handles work queues), it may perform blocking operations.
Top half and bottom half devices communicate via I/O queues. A device status table keeps track of devices and the current status of each device. Each device has an I/O queue associated with it. Any received I/O (reads from the device or writes by the user) is placed on the work queue and scheduled by a kernel thread.
Even though solid state flash storage is becoming increasingly popular, spinning magnetic disks are still the dominant device for storing large amounts of information. Compared to memory, a disk is an incredibly slow device. The buffer cache helps with the performance of frequently-accessed blocks. Beyond that, we rely on disk scheduling algorithms to optimize the flow of data between memory and the disk.
The simplest algorithm is First Come, First Served (FCFS), which processes requests in the order in which they are generated. The elevator algorithm (SCAN) takes the fact that a disk has a head that moves back and forth on the disk. It relies on the system to keep track of the current head position and direction. Requests are sorted in the order in which they will be encountered as the head moves from the track 0 to the highest track on the disk. The head movement then reverses and any outstanding requests are processed from highest to lowest as the head seeks back to track 0.
The LOOK algorithm simply reverses the disk head's direction if there are no blocks scheduled beyond the current point. The Circular SCAN (C-SCAN) algorithm is the SCAN algorithm but schedules disk requests only when the head is moving in one direction to avoid preferential treatment of blocks in the middle. The Circular LOOK (C-LOOK) algorithm is like LOOK but schedules requests in only one direction. In the Shortest Seek Time First (SSTF) algorithm, the queue of current requests is be sorted by cylinder numbers closest to the current head position and requests made in that order.
File systems provide us with the abstraction of persistent and protected objects sitting in a hierarchical name space. The actual data resides on block devices: disks or flash memory. The specific file system that is used on that device defines the data structure on the disk blocks that defines how directories, files, free space information, and permissions are stored.
Multiple individual file systems can be combined into an abstract view of a single hierarchical name space via a mount mechanism. By executing a mount system call and providing it with the block device, underlying file system type, and a mount point in the current file system, the operating system will place the root of the file system on that device right at the given mount point. The operating system will maintain a list of mount points so that attempts to traverse directories will switch to the appropriate device when a mount point is encountered.
In the past, operating systems would support a implementation of a file system. To support multiple file systems transparently, a layer of abstraction called the Virtual File System is used. This is an object-oriented approach to looking at file systems. System calls that interact with files communicate with the VFS layer. This layer implements high-level operations related to managing files and directories that are common to all file systems. The VFS layer then interfaces with one or more file system modules that implement the specific underlying file system. This file system module interacts with the buffer cache and with block device drivers to read and write data to the device containing the file system. The layers are:
- system call interface: APIs for user programs
- virtual file system: manages the namespace, keeps track of open files, reference counts, file system types, mount points, pathname traversal.
- file system module (driver): understands how the file system is implemented on the disk. Can fetch and store metadata and data for a file, get directory contents, create and delete files and directories
- buffer cache: no understanding of the file system; takes read and write requests for blocks or parts of a block and caches frequently used blocks.
- device drivers: the components that actually know how to read and write data to the disk.
At the VFS layer, the crucial components are:
- An inode uniquely identifies a file. In file system, it stores key information about the attributes of a file and the location of its data. At the VFS layer, it contains functional interfaces for creating files, directories, symbolic links, and reading and setting attributes. In short, it contains directory and file operations that do not relate to the file's data.
- A dentry, short for directory entry, is an abstract representation of directory contents. It contains the filename as a string, its inode (so we can access its contents), and a pointer to its parent dentry. A directory file can be read and converted into a set of dentry structures.
- The VFS interface keeps track of open files and contains a set of interfaces to open, close, read, and write files as well as to map them to memory or lock them. The file object represents an open file and keeps track of the access mode and reference counts.
- Contains interfaces to get information about the file system, read/write/delete inodes, and lock the file system.
Managing files on disks
The lowest level of working with file systems and storage media is managing and allocating free disk blocks to files. For flash memory, there are no performance implications of using one block versus another one. For disks, however, that is not the case. Seek times greatly affect performance and the goal is to keep frequently-used or related blocks together. Ideally, files would use contiguous allocation. That is, if a file's data requires multiple disk blocks, the blocks would be contiguous. Unfortunately, as files get deleted and new ones get created, we'll find that we get external fragmentation as regions of free blocks too small for most files get scattered between existing files. Moreover, we usually don't know ahead of time how much space a file needs.
A compromise to contiguous allocation is to use extents. An extent is a contiguous set of blocks. A file will comprise one or more extents. File systems that use extents generally refer to extents (as a <block number, size> tuple) instead of block numbers when managing a file's data. A cluster is a logical block used throughout the file system that is a grouping of multiple physical blocks. For example, many disks provide 512 byte blocks but a file system may use 4096-byte clusters as its basic allocation unit, reading and writing eight blocks at a time. This results in better performance but also increases the amount of internal fragmentation in the file system.
The technique of linked allocation uses a few bytes of each block to store the block number of the next block in a file. An alternate implementation of this is a file allocation table (FAT). Here, a table of block numbers is created. The entry at table[block_num] contains the block number that follows the block block_num in the allocation list for a file. The directory, in addition to containing the file name, will also store the first block number that is allocated to the file. The entire chain of blocks can be read from the file allocation table.
For FAT to be efficient, the entire table must be kept in memory. As an alternative approach, we can just read in just the list of blocks used by a specific file when we open the file. What this means is that with each file we associate the entire list of blocks that the file uses. This approach is called indexed allocation but it isn't practical since these will be varying size lists and, for big files, may be extremely long. A variation of this, combined indexing uses fixed length structures. The information structure for a file, the inode, contains a list of just the first several blocks of the file. These are direct block pointers. If a file needs to use more blocks, the inode contains an indirect block pointer. This is the number of a block that contains list of direct block pointers (block numbers for data). If a file needs even more blocks, a double indirect block pointer in the inode will point to a block whose contents are a list of indirect block pointers. This is the approach used in the traditional UNIX file system as well as Berkeley's fast file system (FFS) and Linux's ext2, ext3, and ext4 file systems. A directory is just a data file that contains a series of names, each with their associated inode number. The inode number contains file permission information, creation/modification/access times, an identification of whether the file is a regular file or a device, and direct, indirect, double, and (in some cases) triple indirect block numbers.
File system implementation case studies
Unix File System (UFS)
The Unix File System is an implementation of the inode-based just described. When laid out on the disk, it contains three sections: a super block, inode blocks, and data blocks. The super block contains information about the file system, including a pointer to an linked list of free blocks. Directories are files that contain a list of names and corresponding inodes. An inode identifies a file as being a regular file, directory, or device file.
The performance of UFS was generally quite poor (2-4% of raw disk bandwidth). Several factors contribute to this inefficiency. The linked list approach to free block allocation leads to a high degree of fragmentation over time. In addition to our earlier encounters with the terms internal fragmentation and external fragmentation, we have another use of the term within file systems. Fragmentation is when a file's data is not allocated among contiguous blocks. The small block sizes typically used by UFS make that degree of fragmentation even higher than it would be otherwise and necessitate the use of double and triple indirect blocks for even not-very-large files. Finally, the fact that inodes reside at one end of the disk (beginning) and much of the data resides at the other end (middle to end) leads to large disk seeks between reading a file's inode and that file's data.
Berkeley Fast File System (FFS)
The Berkeley Standard Distribution variant of UNIX redesigned the file system to improve performance in several areas:
- Larger clusters. FFS chose a larger cluster size: at least 4 KB instead of the 512 bytes or 1K bytes of UFS. This brings about an eight-fold improvement improvement in contiguous allocation. Also, each block pointer in an inode or disk block now addresses much more data, so a file has to be far bigger before indirect, double indirect, and triple indirect blocks have to be used. To avoid too much internal fragmentation from large cluster sizes, FFS allowed the last cluster of a file to be a fragment, part of a cluster.
- Cylinder groups. Instead of having one region for inodes and another for data, multiple such regions were created. FFS looks like lots of little file systems laid out on one disk. In most cases, the data that corresponds to an inode will be located in the same cylinder group, minimizing seek times.
- Bitmap allocation. Instead of using a linked list of free blocks, a bitmap is used to keep track of which blocks are in use within each cylinder group. This makes it easy to find contiguous or nearby clusters for a file's data rather than just using any available free cluster. Moreover, FFS will preallocate adjacent clusters to a file on the chance that the file may need them as it grows.
- Prefetch. If two or more sequential blocks are read from a file, FFS assumes that file access is sequential and will prefetch additional blocks from that file.
These changes resulted in a 5-10 times performance improvement over UFS.
Linux's ext2 file system is an adaptation of Berkeley's FFS. With bigger disks, space wasted due to internal fragmentation was no longer considered that precious and worth saving, so managing fragments was abandoned to simplify cluster allocation. Instead of cylinder groups, ext2 uses the term block group since modern disks make it impossible to identify cylinders; we just have the abstraction of a long linear list of blocks. To improve performance over FFS at the expense of possible disk corruption, ext2 caches all data aggressively in the buffer cache. FFS would flush metadata changes aggressively to minimize corruption of file structure even though the data was still at risk.
Journaling and Linux ext3
A file system is consistent if all free block and inode bitmaps are accurate, all allocated inodes are referenced by files or directories, and each data blocks is either marked free or is referenced by an inode. Whenever a file is modified, a disk block may be allocated, free block bitmap modified, inode modified, and disk block written. If anything happens to the system, such as a spontaneous loss of power, only some of these changes may have been written out to the disk, resulting in an inconsistent state. The consistent update problem is the challenge of performing all updates atomically: either all of them should be applied to the file system or none should be, ensuring that the file system can always remain in a consistent state. Journaling is a method of providing this guarantee of atomicity. All changes associated with a file modification are first written as a log to a journal (a dedicated area on the disk). The file system itself is not touched. When all changes are complete the journal log is marked with a transaction-end. Only when the transaction has been made permanent on the disk, the same sequence of changes is now applied to the file system. When complete, the transaction entry is deleted from the journal. Should anything happen to the system, upon restart, any transactions in the journal are applied to the file system to bring it to a consistent state.
The downside of journaling is performance: all disk operations are applied to the journal and then applied again to the file system. By giving up some integrity, there are several ways that performance can be improved.
- Full data journaling
- As described above, all disk operations are written to the journal first and then applied to the file system.
- Ordered journaling
- This writes data blocks first, then journals only the metadata (inode information, bitmaps, and superblock changes), terminates the transaction, and then applies all the metadata that was journaled onto the file system. The file system is always kept in a consistent state but in some cases may lead to partially modified files if, for example, data was written to existing and new blocks of a file but the system died before an end-of-transaction record was written.
- Writeback journaling
- This journaling does metadata journaling but without writing the data blocks first. Recently modified files can be corrupted after a crash. It's the fastest of the journaling options and a weak form of journaling since it can lead to data corruption. However, it leads to no more corruption than with non-journaled file systems and can save a lot of time in file system recovery.
Linux's ext3 file system is the same as ext2 but with the addition of a journal in each block group.
The ext4 file system is an enhancement to ext3 and can support far bigger files. Two key enhancements were:
- Extent-based allocation. Recall that an extent is a set of contiguous clusters. Instead of listing block number in inodes, ext4 uses extents. An extent contains the starting cluster number of a group of clusters along with a cluster count. In cases where contiguous blocks can be allocated, this can allow much more data to be addressed from the list of blocks in an inode. Since computing the block that holds a specific byte offset within a file is difficult with extents (you have to iterate through the list of extents), the use of indirect blocks has been replaced with an Htree (similar to a B-tree) data structure that makes it easy to traverse a tree to search for a block representing a specific offset within a file.
- Delayed allocation of disk space. Instead of allocating data clusters as a file grows, the blocks are kept in the buffer cache and allocated to physical blocks only when they are ready to be flushed. This increases the likelihood that we can allocate contiguous blocks for a file.
NTFS was the successor to Microsoft's FAT family of file systems and has been the file system of choice on every version of Windows Server and every consumers version since Windows Vista. NTFS allocates data in extents on top of clusters and is designed such that just about every structure in the file system appears as a file. Whereas Unix-derived file systems had distinct areas for inodes, journals, and bitmaps, under NTFS they are all just files.
The MFT, or Master File Table is the structure that keeps track of file records (similar to an inode table). It is structured as a B-tree to facilitate searching for a specific file record. Each file record is roughly similar to an inode in that it contains information about the file. A file record contains an arbitrarily long list of attributes. If the list is too long to fit within a single file record, additional extents are allocated, with an attribute in the record identifying their location. The data associated with a file record is just an attribute. Very small files or directories may contain all their data within a file record and not require any additional disk blocks.