The file system is composed of many different levels.  Each level in
the design uses the features of lower levels to create new features
for use by higher levels

application programs

logical file system
  Manages metadata information. Metadata includes all of the
  file-system structure except the actual data (or contents of the
  files). The logical file system manages the directory structure to
  provide the fileorganization module with the information the latter
  needs, given a symbolic file name. It maintains file structure via
  file-control blocks. A file-control block (FCB) contains information
  about the file, including ownership, permissions, and location of
  the file contents. The logical file system is also responsible for
  protection and security,

file-organization module
  Knows about files and their logical blocks, as well as physical
  blocks. By knowing the type of file allocation used and the location
  of the file, the file-organization module can translate logical
  block addresses to physical block addresses for the basic file
  system to transfer.  Each file's logical blocks are numbered from 0
  (or 1) through N. Since the physical blocks containing the data
  usually do not match the logical numbers, a translation is needed to
  locate each block. The file-organization module also includes the
  free-space manager, which tracks unallocated blocks and provides
  these blocks to the file-organization module when requested 

basic file system
  Needs only to issue generic commands to the appropriate device
  driver to read and write physical blocks on the disk. Each physical
  block is identified by its numeric disk address (for example, drive
  1, cylinder 73, track 2, sector 10).

I/O control
  Consists of device drivers and interrupt handlers to transfer
  information between the main memory and the disk system. A device
  driver can be thought of as a translator. Its input consists of
  high-level commands such as "retrieve block 123." Its output
  consists of lowlevel, hardware-specific instructions that are used
  by the hardware controller, which interfaces the I/O device to the
  rest of the system. The device driver usually writes specific bit
  patterns to special locations in the I/O controller's memory to tell
  the controller which device location to act on and what actions to
  take.

devices


File Structures
---------------
When a layered structure is used for file-system implementation,
duplication of code is minimized. The I/O control and sometimes the
basic file-system code can be used by multiple file systems. Each file
system can then have its own logical file system and file-organization
modules

Structures, On disk:
  The file system may contain information about how to boot an
  operating system stored there, the total number of blocks, the
  number and location of free blocks, the directory structure, and 
  individual files. Examples:

  Boot control block: information needed by the system to boot from
  that volume. 

  Volume control block: volume or partition details, such as the
  number of blocks in the partition, size of the blocks, freeblock
  count and free-block pointers, and free FCB count and FCB pointers.
  In UFS, this is called a superblock;  use dumpe2fs /dev/sda5

  Directory structure: organize the files. In UFS, this includes file 
  names and associated inode numbers.

  FCB (per file) contains many details about the file, including file 
  permissions, ownership, size, and location of the data blocks. In
  UFS, this is called the inode.  Page 9

Structures, In memory:
  For file-system management and performance improvement via caching. 
  The data are loaded at mount time and discarded at dismount. The 
  structures may include the ones described below:

  Mount table contains information about each mounted volume.

  Directory-structure cache holds the directory information of
  recently accessed directories. 

  Open-file table (system-wide) contains a copy of the FCB of each
  open file, as well as other information.

  Open-file table (per process) contains a pointer to the appropriate 
  entry in the system-wide open-file table, as well as other information.

  Page 11

  To create a new file, an application program calls the logical file
  system.  The logical file system knows the format of the directory
  structures. To create a new file, it allocates a new FCB.
  (Alternatively, if the file-system implementation creates all FCBs
  at file-system creation time, an FCB is allocated from the set of
  free FCBs.) The system then reads the appropriate directory into
  memory, updates it with the new file name and FCB, and writes it
  back to the disk.

  The logical file system can call the file-organization module to map
  the directory I/O into disk-block numbers, which are passed on to
  the basic file system and I/O control system.

  Now that a file has been created, it can be used for I/O. First,
  though, it must be opened. The open() call passes a file name to the
  file system. The open() system call first searches the system-wide
  open-file table to see if the file is already in use by another
  process. If it is, a per-process open-file table entry is created
  pointing to the existing system-wide open-file table. This algorithm
  can save substantial overhead. When a file is opened, the directory
  structure is searched for the given file name. Parts of the
  directory structure are usually cached in memory to speed directory
  operations. Once the file is found, the FCB is copied into a
  system-wide open-file table in memory. This table not only stores
  the FCB but also tracks the number of processes that have the file
  open.

  Next, an entry is made in the per-process open-file table, with a
  pointer to the entry in the system-wide open-file table and some
  other fields. These other fields can include a pointer to the
  current location in the file (for the next read() or write ()
  operation) and the access mode in which the file is open.  The
  open() call returns a pointer to the appropriate entry in the
  per-process file-system table. All file operations are then
  performed via this pointer. The file name may not be part of the
  open-file table, as the system has no use for it once the
  appropriate FCB is located on disk. It could be cached, though, to
  save time on subsequent opens of the same file. The name given to
  the entry varies. UNIX systems refer to it as a file descriptor;
  Windows refers to it as a file handle. Consequently, as long as the
  file is not closed, all file operations are done on the open-file
  table.

  (look in include/fs.h for struct file {..})

  When a process closes the file, the per-process table entry is
  removed, and the system-wide entry's open count is decremented. When
  all users that have opened the file close it, any updated metadata
  is copied back to the disk-based directory structure, and the
  system-wide open-file table entry is removed.

Partitions
  Raw: no file system (e.g. a swap partition)
  Cooked: a file system
  Mounted: look at partition table (mtab)

  Unix: mounting is implemented by setting a flag in the in-memory
  copy of the inode for the directory on which a partition is
  mounted (the flag indicates that the directory is a mount point). 

  A field then points to an entry in the mount table, indicating which
  device is mounted there.  The mount table entry contains a pointer
  to the superblock of the file system on that device. This scheme
  enables the operating system to traverse its directory structure,
  switching among file systems of varying types, seamlessly.

Virtual File System
  Uniform approach to supporting many different file systems

  In linux - a vnode is an inode  
  
  typedef struct vnode {  /* Sun OS 5 */
    kmutex_t        v_lock;                 /* protects vnode fields */
    u_short         v_flag;                 /* vnode flags (see below) */
    u_long          v_count;                /* reference count */
    struct vfs      *v_vfsmountedhere;      /* ptr to vfs mounted here */
    struct vnodeops *v_op;                  /* vnode operations */
    struct vfs      *v_vfsp;                /* ptr to containing VFS */
    struct stdata   *v_stream;              /* associated stream */
    struct page     *v_pages;               /* vnode pages list */
    enum vtype      v_type;                 /* vnode type */
    dev_t           v_rdev;                 /* device (VCHR, VBLK) */
    caddr_t         v_data;                 /* private data for fs */
    struct filock   *v_filocks;             /* ptr to filock list */
    kcondvar_t      v_cv;                   /* synchronize locking */
  } vnode_t;

  
  struct inode {   /* Linux */
    umode_t	    i_mode;
    unsigned short  i_opflags;
    kuid_t	    i_uid;
    kgid_t	    i_gid;
    unsigned int    i_flags;
    const struct inode_operations  *i_op;
    struct super_block             *i_sb;
    struct address_space           *i_mapping;
    /* Stat data, not accessed from path walking */
    unsigned long   i_ino;
    /*
    /* Filesystems may only read i_nlink directly.  They shall use the
     * following functions for modification:
     *
     *    (set|clear|inc|drop)_nlink
     *    inode_(inc|dec)_link_count
     */
    union {
      const unsigned int i_nlink;
      unsigned int __i_nlink;
    };
    dev_t               i_rdev;
    loff_t              i_size;
    struct timespec	i_atime;
    struct timespec	i_mtime;
    struct timespec	i_ctime;
    spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
    unsigned short      i_bytes;
    unsigned int	i_blkbits;
    blkcnt_t		i_blocks;
    /* Misc */
    unsigned long	i_state;
    struct mutex	i_mutex;

    unsigned long	dirtied_when;	/* jiffies of first dirtying */

    struct hlist_node	i_hash;
    struct list_head	i_wb_list;	/* backing dev IO list */
    struct list_head	i_lru;		/* inode LRU list */
    struct list_head	i_sb_list;
    union {
       struct hlist_head	i_dentry;
       struct rcu_head		i_rcu;
    };
    u64			i_version;
    atomic_t		i_count;
    atomic_t		i_dio_count;
    atomic_t		i_writecount;
    const struct file_operations   *i_fop; /* former ->i_op->default_file_ops */
    struct file_lock	*i_flock;
    struct address_space   i_data;
    struct list_head	i_devices;
    union {
	struct pipe_inode_info	*i_pipe;
	struct block_device	*i_bdev;
	struct cdev		*i_cdev;
    };
    __u32			i_generation;
    void	*i_private; /* fs or device private pointer */
  };


  Page 14

  The VFS activates file-system-specific operations to handle local
  requests according to their file-system types and even calls the NFS
  protocol procedures for remote requests. File handles are
  constructed from the relevant vnodes and are passed as arguments to
  these procedures. The layer implementing the file system type or the
  remote-file-system protocol is the third layer of the architecture.

  VFS architecture in Linux. The four main object types are:
   * The inode object, which represents an individual file
   * The file object, which represents an open file
   * The superblock object, which represents an entire file system
   * The dentry object, which represents an individual directory entry

  For each of these four object types, the VFS defines a set of
  operations that must be implemented. Every object of one of these
  types contains a pointer to a function table. The function table
  lists the addresses of the actual functions that implement the
  defined operations for that particular object (struct
  file_operations)


  Thus, the VFS software layer can perform an operation on one of
  these objects by calling the appropriate function from the object's
  function table, without having to know in advance exactly what kind
  of object it is dealing with. The VFS does not know, or care,
  whether an inode represents a disk file, a directory file, or a
  remote file.

Directory Implementation
  The selection of directory-allocation and directory-management
  algorithms significantly affects the efficiency, performance, and
  reliability of the file system.

  Linear list: list of names with pointers to data blocks
    - Simple to program - expensive to execute
    - To create a new file 
        - search the directory to be sure no existing file has the same name. 
        - add new entry at the end of the directory. 
    - To delete a file
        - search the directory for the named file
        - release the space allocated to it
    - To reuse the directory entry
        - mark the entry as unused (give it a special name) or
        - attach it to a list of free directory entries or 
        - copy the last entry in the directory into the freed location
    Use a linked list to decrease the time required to delete a file

    Problem: finding a file requires a linear search. 
      Directory information is used frequently, and users will notice
      if access to it is slow. 
      Can be mitigated with a software cache to store the most
      recently used directory information. 

      A sorted list allows a binary search and decreases the average 
      search time. However, the requirement that the list be kept
      sorted may complicate creating and deleting files

      So use a B-tree or Red-Black tree to keep the info
      restructuring in log(n) time, not n^2

      linux/btree.h
      linux/rbtree.h
      linux/fs.h -> uses rbtree

  Hash Table: 
   Problems: it is generally of fixed size and the dependence of the hash 
     function on that size.

   Chained-overflow hash table: hash entry can be on a linked list
   starting at some point in the table
   
   Lookups may be somewhat slowed due to the search through the LL on
   collisions - still faster than a linear search through the entire directory

Allocation Methods:
  How to allocate space to these files so that disk space is utilized
  effectively and files can be accessed quickly. 

  Three major allocation methods: contiguous, linked, and indexed

  Page 19

  * Contiguous: file is completely contiguous in media - head grabs a
    whole track then moves one track - latency is minimal
    File location = disk address and length
    Fits nicely with sequential access
    Problem: external fragmentation - note: same algorithms may be used
    as discussed earlier unders segments

    To defrag: If disk is not large - copy all files to some other
    location then bring them back in one at a time.  Can be done offline

    More serious problem: do not know ahead of time how big a file is
    going to be when it is created.  So initial chunk given to it may
    fill and a new check may need to be obtained and copied into.

    A file may grow very slowly and be around for months

    Possible solution is extents - but then may have internal
    fragmentation if extent size is too large  (see result of defrag.pl)

  Page 22-25

  * Linked Allocation
   
    Solves all of above problems.  Linked list of blocks that may be
    scattered all over.

    Link info may be in the blocks - so a 512 byte block may have 508
    bytes available to a user

    Size of file need not be declared when created
    No external fragmentation

    Works fine for sequential access - 
    Problem: what about random access?
    Problem: space needed for the pointers (minor)
    Partial solution: collect blocks into clusters and use ll for the
      clusters
    Solution: FAT - cache a portion of the table then compute the link
      number containing the location to the referenced.
      If there are 4 link bytes per cluster, each cluster is 1K, then
      1GB -> 4,000,000 bytes for a FAT.
    Problem: if a pointer is lost the file is corrupted.
    Partial solution: use extra space to make the link information
      redundant

  Page 26-32

  * Indexed Allocation:
    
    Put all the "cluster" links in contiguous space - called the index
    block

    Supports direct access, without external fragmentation, because
    any free block on the disk can satisfy a request for more space. 

    Problem: may be wasted space
      Whole FAT uses say 4MB, redundancy makes it 8MB.
      For small files, index block may require 128 bytes -
      for 1000000 small files this is 4MB

    Solution: index block links to other index blocks