|20-CS-110-001||Introduction to Computer Science||Fall 2010|
DNA Gene Identification
Due: November 9, 2010
Submit one matlab program as two separate files via blackboard. See submit page for instructions.
|The objectives of this assignment are to a) understand rudimentary pattern matching algorithms; b) understand how important it is to design code that runs efficiently in the presence of very large data files.|
The Human Genome Project
The Human Genome Project was an international effort to discover all of the approximate 30,000-100,000 human genes (the human genome) and to determine the complete sequence of the 3 billion DNA subunits (called bases). The results of the project have been made publicly available in an effort to lower the barriers to effective biomedical research. The project was formally begun in October 1990 and was planned to last 15 years. However, rapid technological advances resulted in the completion of the project by 2003, two years ahead of schedule.
As part of the project, parallel studies were carried out on selected model organisms, such as the bacterium E. coli and the Drosophila, to help develop the technology to quickly sequence genes and interpret human gene function. The U.S. Department of Energy's Human Genome Program (operated by Lawrence Berkley Laboratory) and the National Institutes of Health's National Human Genome Research Institute (NHGRI) together make up the U.S. Human Genome Project.
Private companies are also racing to be the first to completely map the genome. Companies like Celera, Affymetrix, and Gene Logic, are hoping to patent their information before the government can discover it and release it freely. If you are more interested in the HGP then check out the website at http://www.ornl.gov/hgmis or USAToday.com's set of stories on genomics.
Computer Modeling of DNA
TACCGAAAGGGAAGAATAAAAGTGGCGTGATGCATTwhich starts at position 19 (7th codon) in the sequence, from the left, and has 36 base-pairs (12 codons). Genes do not overlap: this implies the end codon of a gene is the first one that is encountered after the start codon in a sequence. In humans, the average gene is about 20,000 base-pairs long.
Important: all codons begin at positions in the DNA sequence which are multiples of 3, plus 1. That is, if p mod 3 == 1 then p is a possible codon start position. In other words, a codon may begin at any of the positions 1,4,7,10,13,16,...
Scientists usually send off a chunk of DNA to be sequenced by an automatic sequencing machine. The results come back as shown in the figure below.
where the colors indicate the probability that one of the four nucleotides is at a particular spot. This type of figure is converted into a sequence of letters, like ACGT.... It is up to the scientists to analyze the string of letters to determine what the sequence actually does.
|Your assignment is to read a file containing a genetic sequence
and identify the number of genes, their starting position relative to
the beginning of the file, and their length. Your program will output
a report to the screen and to a file indicating who generated the
report and the information listed above. To do this, you will have to
write a function that automatically searches for and finds the start
and end codons. The input file format is just a string of 'A',
'C', 'T', and 'G' characters, all on one
Write the program to solve this problem as a function called pattern which takes two arguments: the first, named filename, specifies the name of the input file and the second, named identity, specifies who is using the program. When run, the user specifies the name of the file to be read. For example,
> pattern('short.dat','John Franco');is how to execute your code from Matlab assumming that a file named short.dat exists in the current directory and the user is the instructor.
Analysis and Coding:
|Read the given file into a variable genome as follows:
fid = fopen(filename,'rt'); % Open the input file if fid < 0 % Quit if file is not opened error(['File ' filename ' not found']); end genome = fgetl(fid); % Read the file into genome fclose(fid); % Close the input filewhere filename was obtained as an argument to the pattern function.
Open an output file named outfile.txt for writing like this:
fout = fopen('outfil.txt','wt'); % Open the output file if fout < 0 % Quit if file is not opened error('File outfile.txt cannot be opened for writing'); end
Write to the output file using fprintf like this:
fprintf(fout,'DNA file %s.dat output produced for %s on %s\n\n', ... filename, identity, date());which is a reasonable output file header.
How it works, in pictures:
Looking for T in TAC with the red marker.
. . .
How it works, in Matlab:
Establish a variable, say strt_codon, as an index into genome that points to the current start codon for a gene. Initially, strt_codon is 1. Establish a variable, say end_codon, as an index into genome that points to a codon that is to be checked to see if it is one of three end codon patterns. This variable is not set until a start codon has been located. When that happens, end_codon takes the value of strt_codon plus 3. The end_codon cursor advances and tests are made until an end codon is found. At that point, statistics are recorded about the gene that has been discovered and the strt_codon variable is set to end_codon plus 3. The strt_codon cursor then advances and tests are made to locate a start codon. When one is found, the above process repeats.