Disk Management
---------------

  Disk Formatting
  - Low-level formatting: write a data structure for each sector
    consists of a header, a data area, a trailer 
  - Included in the header and trailer is an Error Correcting Code (ECC)
  - ECC is calculated on a write, checked on a read (and after write)
  - Contains enough info that correct data can be reconstructed if
    just a few bits are corrupted
  - Low-level formatting is done at the factory to test the drive

  - Blocks are grouped into larger chunks called clusters
  - Disk I/O is done via blocks, file system I/O is done via clusters

  - Disk may be formatted as a raw disk so that records can be placed
    in disk locations

  Boot Record
  - a small bootstrap program, located in the master boot record, does this:
    1. locates the OS kernel on disk
    2. loads the kernel into memory
    3. jumps to an address in memory to start the kernel

  Bad Blocks
  - Because blocks go bad - moving parts
  - Detecting and handling:
    * On boot, fsck can find bad blocks - then they can be marked and
      so taken out of use - that means some logical block addresses
      map to bad blocks
    * Maintain spare blocks - remap block addresses accordingly
      Name: sector sparing
      This is done by the disk controller
      Yikes! - what happens to OS disk scheduling algorithms?
      The controller and the OS need to talk
      Kind controllers will at least try to get a spare from the same cylinder
    * Better yet - sector slipping 
      example:
        logical block 17 becomes defective 
        first available spare is sector 203
        remap all the sectors from 17 to 202 down one spot
          (202 is copied into spare, 201 into 202, ..., 18 into 19)
        18 is free and 17 can be mapped to it.

  Swap Space Management
  - Modern systems combine swapping with virtual memory - pages get swapped
  - Virtual memory uses disk space as an extension of main memory
  - Goal: provide the best throughput for the virtual memory system

  - Linux: a separate partition with a different disk format
  - Other: a file on a normal partition (inefficient due to navigation
           of the directory structure to find appropriate blocks
           and if the blocks are not contiguous things get worse

  - On separate partition
    * raw file system, no directory structure
    * separate swap manager is used to allocate and deallocate
    * internal fragmentation may increase, but this is acceptable 
      because the life of data in the swap space is much shorter files'
    * plus swap space is reinitialized at boot time (frag probs wiped out)

  -Linux: 
    * swap space is only used for anonymous memory (heap and stack) or 
      regions of memory shared by several processes
    * swap areas contained 4KB slots are defined to hold swap pages
    * each swap area has an array of counts, one for each slot
      - a count of 0 means the slot is avaliable
      - otherwise the count is the number of processes referencing the slot
    * try 'swapon -s'

    * As a process runs, anonymous pages to support it may be
      allocated in RAM
    * Once modified, an anonymous page must remain in RAM for the
      duration of the program **unless** there is secondary storage to
      write it to - that is swap space
    * Hence! swap space does not inherently slow down a system
      - not having swap space doesn't mean pages will not be swapped
        only that Linux has fewer choices about what RAM can be reused 
        when demand for pages happens -  thus, it is possible for the 
        throughput of a system that has no swap space to be lower than 
        that of a system that has some.
      - programs, file system caches, and shared libraries are never 
        put into swap space
      - minimizing swap space really boils down to minimizing wasted
        space, not so much trading off performance


RAID - Redundant Array of Inexpensive Disks
-------------------------------------------

Motivation: 
  assume MTF of a single disk is 100,000 hours for 100 disk array, 
  there will be a failure every 1000 hours = 41 days!
  assume some data is lost on a disk failure - 41 days is unacceptable
  but using the disks in a redundant array improves this significantly

  - Mirroring: duplicate all data N times 
    * if all the data fits on M disks, replicate over (N+1)*M disks
      In case of failure - use a good set, then copy after repair
    * expensive
    * Example:
      assume: MTF = 100000 hours per disk, N=1, M=1, repair time = 10 hours
      which includes recovery time
      prob (2 failures in 10 hour period) = (1/10000)*(1/10000) = 1/100,000,000

      ave = p + 2p(1-p) + 3p(1-p)^2 + 4p(1-p)^3 + ...
          =   p(1 + (1-p) + (1-p^2) + (1-p)^3 ...) = 1
              (s=1+(1-p) + (1-p)^2 ... -> s = 1 + (1-p)s -> s = 1/p)
            + p(1-p)(1 + (1-p) + (1-p^2) + ...) = (1-p)
            + p(1-p)^2(1 + (1-p) + (1-p^2) + ...) = (1-p)^2
            ...
                                                 ----------------
                                                   (1/p)

       MTF = 100,000,000*10 hours = 1 billion hours!
       However, disk failures are not independent due to power failures etc.
       Rate at which read requests can be handled is multiplied by (N+1)
       The number of reads per unit time is the same

  - Bit level disk striping
    * with M disks, put 1/M of each byte onto a disk, ex: M=8, 1 bit
      to each disk.
    * now sectors seem to be M times larger, rotational latency
      reduced by factor of M and access time is M times less (that
      many fewer cylinders to traverse to get to the sector to read)
  - Block level disk striping
    * blocks are striped across disks in the same way
    * this is the most common

  Performance improvements
    * increases the throughput of multiple small accesses (that is, page 
      accesses) by load balancing - avoid lots of I/O requests to a
      single hard drive or a few hard drives
    * reduce the access time and transfer time of large accesses -
      data to transfer is on fewer cyclinders

  - RAID Levels (Page 33)
    * Level 0 – non-redundant striping
      Data spread across all disks – no redundancy
      improves performance as noted above

    * Level 1 – mirroring
      Have duplicate copies of all disks
      improves reliability but expensive

    * Level 2 – memory-style ECC organization
      Store two or more extra bits to detect and correct errors

    * Level 3 – bit interleaved parity organization (reconstructs sectors)
      Parity on one disk, no striping of data disks
      Takes advantage of disk controller being able to detect errors on read
      If one of the sectors is damaged, which one is known,
       compute the parity of corresponding bits over all corresponding
       sectors, of good disks - if parity equals stored quantity
       the missing bit is 0, otherwise it is 1.
      Less expensive than Level 2 because fewer "overhead" disks are used
      Every disk has to participate in a single I/O op so # I/O accesses/sec
       is lower than, for example, Level 1
      But transfer rate is higher since reads are spread out over all disks
      If the CPU computes parity I/O is slowed, some hardware supports this

    * Level 4 – block interleaved parity organization
      Parity on one disk, block level striping across M disks
      If a disk fails, parity is used to correct blocks of the failed disk
      A block read occurs at one disk, allows other disk reads at same time
      hence - higher I/O rate although slower transfer time
      However, for large reads, if data is spread over disk, fast
        transfer since all blocks are read simultaneously
      Write a portion of a block (read-modify-write)
       - read block
       - modify portions that change
       - write the block
       - read the parity block
       - modify parity block
       - write the parity block
      Disks can easily be added - all blocks initialize to 0 and
         parity is unchanged

    * Level 5 – block interleaved distributed parity            (Page 34)
      Parity and data spread across all disks – quite common
      (above store data in M disks and partity in 1 disk)
      By distributing the parity blocks over the entire set overuse
        of the parity partition is avoided

    * Level 6 – P+Q redundancy
      Stores extra parity to protect against multiple failures
      

    * Level 0 + 1
      Spread data across all disks and mirror

             D1--D2--D3--D4  <- striped  (two equivalent stripes)

                 Mirror

             D5--D6--D7--D8  <- striped

      Level 0 provides performance, Level 1 provides reliability
      But it is still expensive

      A single disk failure knocks out an entire stripe

    * Level 1 + 0

             D1       D2       D3       D4
              | mirror | mirror | mirror |  <- (a single stripe)
              |        |        |        |
             D5       D6       D7       D8

      A single disk failure brings down a pair but the other pairs can
      still function

    Levels 0, 1, & 5 most common, 6 starting to be used


Additional Data Protection Options
----------------------------------
  - redundancy through clustering
    * cluster = group of computers that are closely coupled
    * maintain a redundant copy of data on another system in cluster
  - use remote journaling or similar sort of replication
  - use remote mirroring
  - when primary copy of data fails, switch to redundant copy on
    different system in the cluster
  - can also use clustering to protect against system failure (not just
    disk failure)


Stable-Storage Implementation
-----------------------------
  - information in Stable Storage is never lost
  - stable storage is useful or necessary for a number of scenarios
    * write-ahead logging discussed earlier requires it
  - to implement stable storage:
    * replicate information on more than one nonvolatile storage media
      with independent failure modes.
    * update information in a controlled manner to ensure that the
      stable data can be recovered after any failure during data transfer
      or recovery.
    * successful completion
    * Partial failure - failure occurs during transfer
    * Total failure - failure occurs before transfer begins

  - detection and recovery:
    maintain two physical blocks for each logical block
    an output operation is executed as follows:
      1. write the information onto the first physical block.
      2. when the first write completes successfully, write the same 
         information onto the second physical block,
      3. declare the operation complete only after the second write 
         completes successfully.

    recovery:
      1. each pair of physical blocks is examined.
      2. if both are the same and no detectable error exists, 
         then no further action is necessary
      3. if one block contains a detectable error, then its contents is
         replaced with the value of the other block.
      4. if neither block contains a detectable error, but the blocks 
         differ in content, then replace the content of the first block
         with that of the second