Hints for Lab 5

Gene Discovery in DNA:

Read the given file into a variable genome as follows:

   fid = fopen(filename,'rt');         % Open the input file
   if fid < 0                          % Quit if file is not opened
      error(['File ' filename ' not found']);
   end
   genome = fgetl(fid);                % Read the file into genome
   fclose(fid);                        % Close the input file
where filename was obtained as an argument to the pattern function.

The genome variable is a string of characters. That means it is easy to access each individual character: for example, the tenth character is obtained using genome(10).

Open an output file named outfile.txt for writing like this:

   fout = fopen('outfil.txt','wt');    % Open the output file
   if fout < 0                         % Quit if file is not opened
      error('File outfile.txt cannot be opened for writing');
   end

Write to the output file using fprintf like this:

   fprintf(fout,'DNA file %s.dat output produced for %s on %s\n\n', ...
           filename, identity, date()); 
which is a reasonable output file header.

Use an outer loop to locate the beginning of a gene, then use an inner loop to locate the end. Suppose variable i is used in the outside loop to obtain gene-starting characters in the genome string and j is used in the inside loop to obtain gene-ending characters. The value of i is fixed while looking for a gene-ending sequence in the inner loop. When a gene-ending sequence is discovered in the inner loop the value of i must be updated to be the current value of j plus 3. Thus, the outer loop must be a while loop (a for i=... loop does not allow changing i in its body).

Gene start and end sequences must be on three pair boundaries. That is, the test for a start codon can occur only for i=1,4,7,10,13,16,... and the test for an end codon can occur only for j=4,7,10,13,16,... This means the inner for loop should look like this:

      for j=i+3:3:length(genome)-2
         ...
      end
The test for the start codon looks like this:
    genome(i) == 'T' && genome(i+1) == 'A' && genome(i+2) == 'C'
The test for an end codon looks like this:
    genome(j) == 'A' && genome(j+1) == 'T' && genome(j+2) == 'T' ||...
    genome(j) == 'A' && genome(j+1) == 'T' && genome(j+2) == 'C' ||...
    genome(j) == 'A' && genome(j+1) == 'C' && genome(j+2) == 'T'