|20-CS-122-001||Computer Science II||Spring 2012|
CGTGACAGTCCTCTCCTTTACCGAAAGGGAAGAATAAAAGTGGCGTGATGCATTACGCA DNA sequence such as the one modeled above is read from left to right.
A gene is a subsequence of a DNA molecule and is one of 20 nucleic acids upon which all proteins in the body are built. A DNA molecule is partitioned into sections, called codons, containing three base-pairs each and represented or typed as three labels corresponding to the nucleotides of those base-pairs (for example, CGA). A gene subsequence begins at a codon and ends at a codon. Therefore, the number of base-pairs in a gene is always a multiple of three. There is only one codon type that can begin a gene: TAC. There are three codon types that can end a gene: ACT, ATT, and ATC. For example, the DNA sequence shown above contains the gene:
TACCGAAAGGGAAGAATAAAAGTGGCGTGATGCATTwhich starts at position 19 (7 th codon) in the sequence, from the left, and has 36 base-pairs (12 codons). Genes do not overlap: this implies the end codon of a gene is the first one that is encountered after the start codon in a sequence. In humans, the average gene is about 20,000 base-pairs long.
Problem: Read a file containing a DNA sequence as a single string of characters composed from the set A,T,C,G, and identify the number of genes, their starting position relative to the beginning of the file, and their length.
|Open the input file and determine the length len of the string|
|Get space for the string (char *genome = new char[len];)|
|Repeat the following for all characters in genome, from left to right:|
|If a start codon (TAC) begins at the current genome character:|
|Record the current position as temp|
|Repeat the following for all characters after temp:|
|If an end codon is found starting at the current_position in genome:|
|Output temp and current_position - temp + 3|
|Continue the outside loop from current_position + 3.|