Papers we love – Summary of Design and implementation of log-structured file system

disk

Here are my takeway points from the papers we love meetup on – Design and implementation of log-structured file system.

Here is the link to the video recording of the meetup held at Shopify Toronto on October 5th, 2016 –

  • Introduction : White paper link
  • What is a Disk
  • How is a disk accessed
    • Seek delay
    • Rotational delay
    • Why is random access on disk slow
  • File System
    • Problems solved by File systems
      • Logical organization of data : What data structure can be used to organize all the file/data
      • Physical layout of data : How the logical data structures map to physical data/files on the disk
    • Input and Output of file system
      • Input : Disk and files
      • Output : Interface for files, how to access files via Operating System
    • Logical Layout
      • inode : The Index node is the basic constituent of any file system.
      • We have one inode per file which points to multiple physical file blocks
      • inode has multiple pointers for the physical addresses
      • We also have multi level redirects where an inode can point to another set of inodes which may in turn point to other inodes or physical data blocks
      • Directories also have similar inodes with dir name and i node addresses of files present in the directory path
      • More info on inodes : http://www.grymoire.com/Unix/Inodes.html
    • Why was Unix slow
      • While the logical layout was well thought, the physical layout of the data was not thought. There was no effort in keeping the data (that’s accessed together more frequently) close to each other. Hence we got a hit in the seek time every time when the data was accessed and there was a lot of physical movement in the disk.
      • So where should the data be stored on the disk?
    • Free BSD Fast File System
      • How to store the physical data smartly
      • Place same file blocks and directory contents on the same track or same cylinder or adjacent cylinders
    • Log Structured File Systems
      • Make file system Append-Only
      • No in-place edits : In place edits require multiple disk seeks and writes, and are more error prone. Think crash scenarios where few of the intermediate steps crash and leave the file system in an inconsistent state.
      • Use sequential access rather than random access
      • Create a journal of operations rather than actually making changes in place in the data and files.
      • Behavior is similar to Write ahead logs used in databases
      • cache smaller writes always in buffer to combine and write in a bulk to the disk
      • create new inodes and new file blocks always whenever there is a add/update request
      • the inode map is then updated with the address of the new inodes/data block addresses
        • the inode map is a huge map of each file and directory
        • any disk block not present in the inode map is either free or garbage
        • inode map is entirely cached in memory all the time
        • inode map is also store in disk for reliability
        • inode map follows all the rules of log structured file system
          • whole map is append only
          • is written to disk every time an update to map id required
          • inode map is apparently not tremendously huge as it looks like
      • Need Garbage collection unless we run out of disk
    • Garbage Collection
      • Needed in intervals to free the old inodes (garbage) which are replaced by the newly created inodes, since we don’t ever overwrite those old inodes, and we need to have some cleaning on the disk
      • blocks are not uniformly distributed
      • Compaction and GC is difficult in general and there are few difficult to answer questions
        • When to do GC
        • How often should GC be performed
        • How much performance degradation occurs while GC: Usually about 40% performance hit with different GC techniques
        • Disk GC is way more slow than the memory GC for lot of factors/seeks/updates/copies etc
    • Checkpointing and Rollbacks
      • Its comparatively easy as long we have not already GC’d the old inodes and data blocks
      • since no data is overwritten anytime, all the inodes and data are present and we just have to find the garbage mappings
    • SSD’s
      • Uses flash memory of Nand’s
      • No mechanical moving parts : So that also means that data location is no longer relevant now, and all the data location based optimizations introduced in log-structured file system are not relevant anymore
      • Much better read and writes compared to the moving arm disks (~100x better)
      • Random write is still bad : solution – avoid random writes
      • to write something to a cell the entire page with the cell has to be re-programmed : this is bad since we have to reprogram the entire page, but as a side effect the GC comes free of cost since we will also free the garbage cells from the page all at once
      • each cell has a limit on the number of times it can be re-programmed: this limit is typically 1k – ~ 100k. To avoid this we should try to spread our workload across the whole ssd to avoid repeated re-programming of fewer cells

Thats all for this post, hope its helpful. Keep in touch for summaries of other white papers and meetups i come across.

Cheers

Leave a Reply

Your email address will not be published. Required fields are marked *