Performance wrt file allocation/directory implementation:
---------------------------------------------------------
  Recall: 
    Contiguous - sequential, random access :-) fragmentation :-( simple :-)
    Linked - sequential access :-) random access :-( fragmentation :-)
    Indexed - sequential, random access :-| fragmentation :-) simple :-(

  Hence:
    A system with mostly sequential access should not use
    the same method as a system with mostly random access.

  So some systems support direct-access files with contiguous allocation 
  and sequential access with linked allocation. 
   - the type of access to be made must be declared when the file is created
   - for contiguous allocation size must be known in advance
   - OS needs appropriate data structures and algorithms to support both 
     allocation methods.  
   - files can be converted from one type to another by the creation
     of a new file of the desired type, into which the contents of the 
     old file are copied. The old file may then be deleted and the new file
     renamed.

  Some systems combine contiguous allocation and indexed allocation
   - contiguous allocation for small files (up to 4 blocks) 
   - automatically switch to indexed allocation if the file grows large.

  Adding instructions to the execution path to save one disk I/O is 
  reasonable
   - Intel Core i7 Extreme Edition 990x (2011) at 3.46Ghz = 159,000 MIPS
     http://en.wikipedia.org/wiki/Instructions_per_second
   - Typical disk drive at 250 I/Os per second ->
       159,000 MIPS / 250 = 630 million instructions during one disk I/O
   - Fast SSD drives provide 60,000 IOPS ->
       159,000 MIPS / 60,000 = 2.65 millions instructions during one disk I/O

Free Space Management:
----------------------
  Since disk space is limited -> need to reuse the space from deleted
    files for new files, if possible.
  To keep track of free disk space, OS maintains a free-space list (FSL). 
    FSL records all free disk blocks: those not allocated to a file or directory

  FSL Implementations:
    Bit vector:  (Page 35)
       Each block represented by 1 bit, 1=free, 0=allocated
         
       Intel family starting with the 80386 and the Motorola family
       starting with the 68020 have instructions that return the
       offset in a word of the first bit with the value 1 - 
         for fast determination

       But entire vector must be kept in main memory or else it is too
       slow  64MB  (Page 36)

    Linked List: (Page 37)
       Free blocks keep the pointer to the next free block

       Terribly inefficient due to having to read enough free blocks 
       to acquire enough space for a file.
        
       However, traversing the linked list is not something that is
       done often - only when creating, deleting or extending a file.

       Cannot get contiguous space easily.

    Grouping:
       Linked list can be modified by sticking addresses of say 10
       free blocks into a free block - # accesses to get free space is
       then reduced by a factor of 10

    Buddy:
       Use Buddy system to keep track of blocks of all sizes


Improving the efficiency of disk drives:
----------------------------------------
  Try to keep a file's data blocks close to its inode blocks.
  Hence, spread inodes out over entire disk

  Vary cluster size as file size grows - more efficient traversals
  with larger clusters, although fragmentation is greater, the total 
  percentage of fragmentation is not.

  Pre-allocation or as-needed allocation of metadata structures
   
  All of the following have an effect on efficiency:
    Keep last access date stamp?
    Pointer size?
    Block size?

  Disk controllers have on-board cache large enough to store entire tracks. 
  On a seek, the track is read into the disk cache
  Disk controller then transfers any sector requests to the OS. 
  When blocks arrive in main memory, the OS may cache the blocks there

  Some systems maintain a buffer cache where blocks are kept under 
  the assumption that they will be used again shortly. 

  Other systems cache file data using a page cache. The page cache
  uses virtual memory techniques to cache file data as pages rather
  than as file-system-oriented blocks. 

  Caching file data using virtual addresses is far more efficient than
  caching through physical disk blocks, as accesses interface with
  virtual memory rather than the file system. 

  Linux, Windows use page caching to cache both process pages and file data. 
  This is known as *unified virtual memory*
                    ----------------------

  Benefits of the unified buffer cache: (Page 43)

     consider memory mapped file access
       - read and write are through both caches
       - blocks move from file system to buffer cache
       - virtual memory system does not interfere with the buffer cache
       - hence data must be copied again to the page cache (double caching)
       - possible inconsistencies between caches could cause file corruption
     consider standard system calls
       - read and write go through the buffer cache

     unified buffer cache (Page 45)
       - both memory mapping and system calls use the same page cache
       - double caching avoided
       - virtual memory system can manage file-system data

  Block replacement algorithm - Least Recently Used

  synchronous 
    - writes done in order they are given to file subsystem
    - no buffering of the writes
    - calling routine must wait for the data to reach the disk drive 
      before it can proceed - if there is a system crash, changes will 
      have been committed immediately making it less likely that
      files are corrupted.

  asynchronous
    - writes are done this way the majority of the time.
    - the data are stored in the cache, and control returns to the caller

  Solaris - shows complexities of performance optimizing and caching
    - early versions made no distinction between allocating pages to 
      a process and allocating them to the page cache
    - result: a system performing many I/O operations used most of the 
      available memory for caching pages, page scanner took pages from
      processes when free memory ran low to fill this need
    - later versions optionally implemented priority paging where page 
      scanner gives priority to process pages over the page cache
    - later still a fixed limit is applied to process pages and the 
      file-system page cache, preventing either from forcing
      the other out of memory
    - later still the algorithms to maximize memory use and minimize 
      thrashing changed again.

  The page cache, the file system, and the disk drivers have some
  interesting interactions. 

   - when data are written to a disk file, pages are buffered in the cache
     and the disk driver sorts its output queue according to disk addresses
   - this allows the disk driver to minimize disk-head seeks and to
     write data at times optimized for disk rotation
   - unless synchronous writes are required, a process writing to disk 
     simply writes into the cache, and the system asynchronously
     writes the data to disk when convenient
   - the user process sees very fast writes

   - when data are read from a disk file, the block I/O system does
     some read-ahead

   - but writes are much more nearly asynchronous than are reads so
     output to the disk through the file system is often faster than 
     is input for large transfers, counter to intuition.

Recovery
--------
  Files are in main memory and on disk - some recovery algorithms are
  needed to prevent corruption when a system crashes.

  Consistency checking:
    The consistency checker - 
      - a systems program such compares the data in the directory structure 
        with the data blocks on disk and tries to fix any inconsistencies
      - fsck in UNIX or chkdsk in MS-DOS 
      - Could be run on boot.

      - file allocation method determines how the checker operates -
        * e.g. if linked blocks, file can be reconstructed by traversing
          them (get size info from directory entry).  

    Loss of inode could be a disaster.  
    Hence UNIX caches directory entries for reads 
    but any data write that results in space allocation, or other
    metadata changes, is done synchronously, before the corresponding 
    data blocks are written.

  Backup:
    Use rsync - only backs up files that have changed since last
    backup
 
    Save a full backup "forever" from time to time because it may take a while
    for someone to realize they have lost a file

  Journal:
    apply log-based recovery techniques to file-system, metadata updates.

    - all metadata changes are written sequentially to a log
    - each set of operations for performing a specific task is a
      transaction
    - when changes are written to this log, they are considered committed
      and the system call returns to the user process and continue
    - meanwhile, these log entries are replayed across the actual filesystem
      structures
    - as the changes are made, a pointer is updated to indicate which 
      actions have completed and which are still incomplete.  
    - When an entire committed transaction is completed, it is removed
      from the log file.

    It is more efficient, but more complex, to have logging and other
    functions under separate read and write heads, thereby decreasing 
    head contention and seek times.

    If the system crashes
    - the log file will contain zero or more transactions
    - any transactions it contains were not completed to the file
      system even though they were committed by the operating system
      so they must now be completed.  
    - the transactions can be executed from the pointer until the work 
      is complete so that the file-system structures remain
      consistent.
 
    The only problem occurs when a transaction was aborted - that is,
    was not committed before the system crashed. Any changes from such
    a transaction that were applied to the file system must be undone,
    again preserving the consistency of the file system.  This
    recovery is all that is needed after a crash, eliminating any
    problems with consistency checking.

Network File Systems
--------------------  
    NFS views a set of interconnected workstations as a set of
    independent machines with independent file systems and independent
    Operating Systems. The goal is to allow some degree of sharing 
    among these file systems (on explicit request) in a transparent manner.

    A mount operation:
        - name of the remote directory to mount, name of server storing it
        - mount request is mapped to a Remote Procedure Call (RPC) and 
          forwarded to the remote mount server
        - server maintains an export list that specifies local file
          systems that it exports for mounting, and names of machines
          that are permitted to mount them. 
        - specification can also include access rights, such as read
          only
        - server returns to the client a file handle that serves as
          the key for further accesses to files within the mounted fs.
        - The file handle contains all the information that the
          server needs to distinguish an individual file it stores
          (file-system identifier, an inode number)

    NFS protocol:
        RPCs to support the following:
         - Searching for a file within a directory
         - Reading a set of directory entries
         - Manipulating links and directories
         - Accessing file attributes
         - Reading and writing files

       can be invoked only after a file handle for the remotely
       mounted directory has been established

       Originally stateless - servers do not maintain information
       about clients from one access to another - for robustness

       no file structures exist on the server side. 
       each request has to provide a full set of arguments, 
         including a unique file identifier and an absolute offset 
         inside the file for the appropriate operations.

       no special measures need be taken to recover a server after a 
         crash. 

       every NFS request has a sequence number, allowing the server to 
         determine if a request is duplicated, or if any are missing
         (this is stateless?)

       modified data must be committed to the server's disk before 
       results are returned to the client 

       a client can cache write blocks, but when it flushes them to
       the server, it assumes that they have reached the server's
       disks.

       The server must write all NFS data synchronously - a server
       crash and recovery will be invisible to a client and all blocks 
       that the server is managing for the client will be intact. 

       The consequent performance penalty can be large, because
       the advantages of caching are lost

       A single NFS write procedure call is guaranteed to be atomic
       and is not intermixed with other write calls to the same file.

       Sharing - two users, same file, locking must be used because 
       the requests will span several packets.  But locks are
       stateful - so this service is done outside of NFS and the users
       are on their own in doing this.

       In practice, buffering and caching techniques are employed for
       the sake of performance. No direct correspondence exists
       between a remote operation and an RFC. Instead, file blocks and
       file attributes are fetched by the RPCs and are cached
       locally. Future remote operations use the cached data, subject
       to consistency constraints.

       There are two caches: the file-attribute (inode-information)
       cache and the file-blocks cache. 

       When a file is opened, the kernel checks with the remote server
       to determine whether to fetch or re-validate the cached
       attributes. 

       The cached file blocks are used only if the corresponding
       cached attributes are up to date. The attribute cache is
       updated whenever new attributes arrive from the server. Cached
       attributes are, by default, discarded after 60 seconds. Both
       read-ahead and delayed-write techniques are used between the
       server and the client. Clients do not free delayed-write blocks
       until the server confirms that the data have been written to
       disk. 

       New files created on a machine may not be visible elsewhere for
       30 seconds. Furthermore, writes to a file at one site may or
       may not be visible at other sites that have this file open for
       reading.

WAFL - Write Anywhere File Layout
   - distributed file system
   - provides files to clients via http, ftp, NFS, CIFS

   WAFL is used on file servers that include an NVRAM cache for
   writes.  The WAFL designers took advantage of running on a specific
   architecture to optimize the file system for random I/O, with a
   stable-storage cache in front.

   
   It is block-based and uses inodes to describe files. Each inode
   contains 16 pointers to blocks (or indirect blocks) belonging to
   the file described by the inode. Each file system has a root
   inode. All of the metadata lives in files: all inodes are in one
   file, the free-block map in another, and the free-inode map in a
   third

   Thus, a WAFL file system is a tree of blocks rooted by the root
   inode.

   Important focus for WAFL - Snapshot - inodes are duplicated first,
   changes to blocks cause duplication per block