20-CS-122-001 Computer Science II Spring 2012
Gene Location

Virtual functions, classes, inheritance, lists, queues, stacks, applications

Setup: DNA, the building block of all forms of life, is a double helix molecule consisting of a sequence of base-pairs or rungs. Each rung is formed from either an A-T or G-C pairing where A,T,G, and C correspond to the four DNA nucleotides of adenine, thymine, cytosine, and guanine, respectively. Because of this you only need to specify one side of the sequence to know the entire double helix. Thus, DNA is modeled in a computer program as a sequence of symbols, each taking one of four values: characters 'A', 'T', 'C', and 'G'. For example, the following represents a possible DNA subsequence:
A DNA sequence such as the one modeled above is read from left to right.

A gene is a subsequence of a DNA molecule and is one of 20 nucleic acids upon which all proteins in the body are built. A DNA molecule is partitioned into sections, called codons, containing three base-pairs each and represented or typed as three labels corresponding to the nucleotides of those base-pairs (for example, CGA). A gene subsequence begins at a codon and ends at a codon. Therefore, the number of base-pairs in a gene is always a multiple of three. There is only one codon type that can begin a gene: TAC. There are three codon types that can end a gene: ACT, ATT, and ATC. For example, the DNA sequence shown above contains the gene:

which starts at position 19 (7 th codon) in the sequence, from the left, and has 36 base-pairs (12 codons). Genes do not overlap: this implies the end codon of a gene is the first one that is encountered after the start codon in a sequence. In humans, the average gene is about 20,000 base-pairs long.

Problem: Read a file containing a DNA sequence as a single string of characters composed from the set A,T,C,G, and identify the number of genes, their starting position relative to the beginning of the file, and their length.

    Open the input file and determine the length len of the string
    Get space for the string (char *genome = new char[len];)
    Repeat the following for all characters in genome, from left to right:
       If a start codon (TAC) begins at the current genome character:
          Record the current position as temp
          Repeat the following for all characters after temp:
             If an end codon is found starting at the current_position in genome:
                 Output temp and current_position - temp + 3
                 Continue the outside loop from current_position + 3.