Biology 125:
Introduction to the DNA database.
The purpose of this exercise is to acquaint you with the basic features of the DNA database and to provide you with an introductory experience in the use of the database.
Searches of the DNA database have several uses. Some of them are:
--To identify an unknown gene, which has been cloned based on a criterion that does not depend on the DNA sequence, such as the fact that the message is present in a specific tissue, at a specific developmental stage, or the clone complements a mutation.
--To determine what the similarities are between a gene of known or suspected function and genes with the same function from other organisms.
Part I: Familiarize yourself with the website and
procedures for data base searching.
You would often begin with the DNA sequence of a clone you have obtained. Since we have not done a cloning experiment, we will begin with the sequence of a known gene. Each submission to the database is assigned an Accession Number, which is published in the paper describing the research and can be used to retrieve the sequence from the data base.
1. Go to the WEB page: www.ncbi.nlm.nih.gov/ This takes you to Entrez, the search engine page.
2. From the Search menu and the top left, choose Nucleotide. This is the database you will search. Some other data bases include Pubmed, which you would use to look for a publication and protein, which you would search for a protein sequence. Every DNA sequence in the database has been assigned an accession number. In the box at the top of the page, put the Accession Number: AB182481. Click on GO.
3. The data base will find one match. To get to the entry, click on the blue 1. Now you see the name of the entry and it tells you that this entry describes the mRNA of a Dicer-related gene from Tetrahymena. Click on the accession number.
You should now have the data base entry on the screen. It has lots of information, such as the name and phylogenetic position of the organism from which the DNA was cloned, the type of clone, and the reference to the published paper. Scroll down to see both the DNA sequence and the deduced amino acid sequence encoded by the given DNA (each letter corresponds to a specific amino acid).
Annotation is provided on the left side under Features to help orient you on the clone. For example, CDS stands for coding sequence and indicates nucleotides that encode exons. If there are introns in the sequence, there will be several noncontiguous regions of CDS. In this example no introns are expected because the query is a sequence of the mRNA, that is, coding regions only.
4. Copy the DNA sequence into the memory of the computer.
5. Go to the WEB page www.ncbi.nlm.nih.gov/BLAST
This page offers you several ways to analyze your nucleotide sequence, listed in the section under “Choose a BLAST program to run”. There is a short description of what each program does. Nucleotide BLAST (sometimes called BLASTn) compares your nucleotide sequence to all of the nucleotide sequences in the database, on both strands, and looks for matches.
6. Click on nucleotide BLAST at the left.
7. Assign a name to your sequence and type it in the first line of the box that says “Enter Query Search”, preceeded by a >. For example:
>practice gene
This is called the fasta format.
8. “Return” to the next line and copy the nucleotide sequence into the box. Do not be concerned that there are some numbers interspersed with the nucleotide sequence. Leave the box that says “query subrange” at the right blank. You would use this, for example, if you had the sequence for a long genomic clone and wanted to search the database for homologies to one open reading frame.
9. Choose a database. Look at the possible choices in the menu. There are various EST databases. EST stands for expressed sequence tags and it contains sequences from cDNA clones or partial cDNA clones; i.e. Choose nucleotide collection (nr/nt). nr stands for nonredundant and the data base will screen out lots of multiple copies of sequences from various genome projects.
10. Click on BLAST at the bottom left.
The next page will give you a Request ID. Write it down:_________________________
It will take a few seconds to a few minutes to process your request, depending on how busy the database is, i.e. the number of requests it is currently processing. If it takes too long and you are getting impatient, you can go and do other things and then use the request ID to retrieve the information after it is done. Don’t do this now, but to use request ID, you would scroll down to the box at the bottom on the right side of the BLAST page (two pages back) and click on “Retrieve results”. When the screen comes up, fill in your ID number.
11. Soon a list of all the matches or “hits” will come up. The color bars indicate the degree of similarity between your query sequence and the matching sequences or “hits” found in the database. Examine the key at the top. Red is an excellent match, pink is good, green is OK, etc. The bars below symbolize the matches, with the best match first, and other matches in order of decreasing similarity.
Scroll down to see a list of two sequences or hits from your search. The expect value or E value in the last column indicates how close the match is. The smaller the value, the better the match. A value of e-40 is highly significant. A value of e-05 is worth investigating further. The E value for both of these searches is 0.0, which indicates a perfect match to the search sequence.
Scroll down below the list of matches to see the alignment of your query sequence with each of the matching sequences. You see perfect matches for each of two hits, because they are the same sequence you took out of the database to use as your query. The first sequence was submitted by the authors of the paper that described the gene. The second hit is the same gene from the Tetrahymena genome project.
You did not find any additional matches to your gene, so try a different kind of search.
14. Go back to the BLAST page www.ncbi.nlm.nih.gov/BLAST, and this time choose the program BLASTx. This program takes your nucleotide sequence, translates it into protein sequences in all six reading frames, and searches the protein database for protein sequences that match.
15. Copy the name and nucleotide sequence into the search box. You are asking the program to translate your nucleotide sequence, so you must tell it which codon usage to use. From the menu that says “genetic codes”, choose “ciliate nuclear” because the query gene is from a ciliated protozoan, Tetrahymena. Tetrahymena use the canonical stop codons TAA and TAG to encode glutamine. What will be the consequences if you do not choose the ciliate genetic code?
16. Click on BLAST.
Now there are lots of hits!!! The colored lines in the box at the top indicate the individual hits, in decreasing order of similarity to the query sequence. The alignments for each individual hit are shown below. They match your query sequence with sequences in the database, using the single letter code to indicate each amino acid.
Again, you have a perfect match to the two genes in the data base that you are using as your query sequence. Which end of the protein encoded by the Tetrahymena gene is conserved in the other hits? Look at the third alignment, below the list of matches. This is a similar gene from zebrafish (Danio rerio are the genus and species names). The top line in the alignment indicates the translation of your query sequence, which was a nucleotide sequence. The bottom row, or subject, is the protein sequence that matches the translation of your query. The letters in the middle row show amino acids that are identical between the two sequences. + indicates amino acids that are similar, such as two nonpolar amino acids (I and L), two basic amino acids (K and R), etc.
The numbers at the beginning and end of each line indicate the nucleotide or amino acid in that position of the respective sequences. Note that the range for the query sequence is three times as big as the range for the subject sequence. Why is it that the numbers at the ends of a line for the query and the matching sequence do not coincide 1:1? Hint: think about the kind of search that was done.
Why did you get so many more matches with this program than you did with BLASTn?
Scroll down the list of alignments and note that most of them say “dicer”. This is a good sign, because if your query sequence matches proteins with the same function from several different organisms, it is likely that your query sequence has the same function.
You can find out more about the sequences in the various hits by clicking on the Accession numbers, shown in blue at the left of the list. This will produce the data base entry for the subject sequence and a description of the data similar to that you saw with your first query sequence. In many cases a reference to a paper is provided, so that you can read more about the work on that gene or protein.
17. Go back to the data base entry for AB182481 and this time copy the protein sequence into the computer memory.
Go to the BLAST site and this time select protein BLAST from the box as the program. This program compares an amino acid query sequence to a data base of protein sequences. Enter your sequence in the fasta format and click on BLAST.
Now you see a notice that putative conserved domains have been detected. A domain is a region of a protein that has a particular function, such as a transmembrane domain (a part of the protein that crosses a membrane), an enzymatic activity, a ligand binding domain, etc. The display on your computer shows red bars under a line that represents your protein. It indicates that the conserved domains are in the C-terminal half of the protein. Click on one of the red bars labelled RIBOc. This brings up a description of the domain activity.
Go back to the BLAST page and click on FORMAT. Again you see a list of gene hits and below it are the alignments of the amino acids for each hit with your query sequence.
What is the likely function of this gene?
Part II: Analyzing a partial cDNA clone.
In reality, the first sequence you obtain from a cloning project will not usually be a complete sequence of a gene. Now try a more realistic situation where you start with a partial cDNA clone.
1. Begin with the following sequence, which is posted on the D2L page for our course:
>Tetrahymena cDNA
TAGCAGAACCAGCAACTTTAAATTGAATTAGCTATCAAAAAGAAGGTTTTC
AGGTTTTAAATCACGATGAGCAATATTGTTGTTATGGAGATATTCCAAAGC
ATCAATTAATTGATGGAAGTAAGCTCTAGCGACTTCTTCAGAGAAACGACC
ACTATTGGCAACATATTCGAAGAGTTCACCACCAGCGCAGTATTCTAAAAT
AATACCTAATGCTTCATAAGTTTAACCACTCTTCTTGGTGTAAGTTCCGTT
AGGAATAACTTCAACAAGGTTGACCAAGTTAGGATGATTAAGTTTTTGCAT
AACGCCAATTTCGTGCTTCAAAGTTTTCATGTTAGAGGCTAAGCTATGAGT
GCTCTTGAAAATCTTAGCAGCAACTTATTAGCCATTGAATGAAGCTAATTT
AACCTTGACATGATAACCAGCACCAAGTGTTTTGCCTAAAATATAGTTATT
TAAAGTGGCGTGCTTTTTGTTCAT
This sequence was obtained serendipitously by cloning an RT-PCR artifact. What does that suggest about the cloned sequence?
The goals are two: First, identify the function of the gene you are analyzing. Second, obtain the complete sequence of the gene and annotate it with respect to the location of the translational start codon, the end of the open reading frame, whether there are any introns and if so, their location.
2. Do a BLAST search. Based on your experience in the first part of this exercise, which program will you use?
You may want to copy the first few matches of the BLAST search into a word document for future reference. When making documents of sequences, you should use Courier, Courier New or Monaco fonts. For those fonts, all letters are the same width and therefore alignments will be maintained.
What is the expect value for the best match that is not a Tetrahymena gene and has a putative function assigned to it? __________
What is the likely function of the protein encoded by the Tetrahymena gene?
3. Now you should figure out which strand contains the open reading frame of the Tetrahymena gene. Look at the match between the query sequence and the subject in your first hit. The last line of information above the alignment tells you the reading frame in the query sequence that produced the match. If it is +1, +2 or +3, that is the frame of the query sequence that, when translated, gives you the amino acid sequence for the query in the match. If it is –1, -2 or –3, that means that the amino acid sequence in the query was obtained by translating that frame of the complementary sequence. If that is the case, you will want to generate the sequence on the RNA-like strand. You can do this by going to the web site http://www.bioinformatics.org/JaMBW/2/1/index.html
Look at the directions at the top under “Aim” and look at the different manipulations the program will do. Which one do you want? Paste your DNA sequence in the box for input sequence. Click on “Convert”. To check that you have done this correctly and chosen the correct form of the sequence, do the BLASTx search with this sequence. You should get the same hits as you did before, but now in one of the + reading frames.
4. The next goal is to figure out what part of the gene is in your cDNA clone. To get an idea, compare the region of the match between the query sequence and the subject sequence of at least two of the hits. Compare the numbers at the ends of the protein alignment. Do you think you have the 5’ or the 3’ end of the gene? You can get the entire protein sequence for the hits by double clicking on the accession number at the top of the alignment. What part of the subject proteins aligns with your query sequence and what part of the gene encoded by the partial cDNA clone is likely to be missing?
5. In order to determine the sequenced of the entire gene, you will need additional DNA sequence. Luckily, the Tetrahymena genome has been sequenced so you will not be required to do several weeks or months of genomic restriction mapping and cloning. Go to http://seq.ciliate.org/cgi-bin/blast-tgd.pl to do a search. Copy your mRNA-like sequence into the Query box in fasta format. You are looking for additional DNA sequence, so what program will you use? Next, choose a database to search. Use T. thermophila mac genome (DNA only). Click on “Start search”.
The first match is perfect over a long stretch of about 400 bp of the cDNA clone with a scaffold. Notice that the match begins with bp 71 of the query sequence. Note that below the first match you have another alignment that shows a close match to the first 70 bp of the clone. Compare the coordinates in the subject sequence near nucleotide 70 of the query sequence for the two alignments. How do you explain this difference? Why are the matches to the query sequence in two parts? (Hint: Remember that the source of the original clone that is the query sequence was a RT-PCR product.)
To check your hypothesis:
Note that above the alignments there is a box with TTHERM_00220990. This is the number given to the gene prediction in the Tetrahymena genome database. Click on that number to get more information on the database entry.
This page gives you some information on cDNA clones that have been isolated for this gene. Click on TTHERM_00220990 again. This page gives you numbers and links to homologous genes from various organisms.
Now go to the Sequence ID (at the left) and double click on the pink TTHERM-00220990. This page gives you information about the gene as predicted by the computer program. There is a picture of the gene with broad red lines (exons) and thin black lines (introns). This is the structure of the gene as predicted from the sequence by a computer program. Did the computer predict an intron at the position in the gene that your cDNA clone indicates? How does the size of the predicted intron compare with your data?
Now go to the top of the page and click on Download Sequence. This gives you the genomic sequence for the predicted gene. Copy it into the computer.
Now we are going to align the genomic sequence you just copied with the sequence of the cDNA clone.
Go to http://bioinfo.genopole-toulouse.prd.fr/multalin/. (What country is this site in?)
Paste your genomic sequence and your cDNA sequence into the Sequence data box, both in the fasta format. Give them completely different names, or the computer program might not run. Call them >genomic sequence and >cDNA clone.
Go down to the Optional Parameters area. For Symbol comparison table, choose DNA-5-0 from the menu. This tells the program that you are aligning nucleotide sequences, as opposed to amino acid sequences.
Scroll down and put 100 in the box that says Maximum line length.
Scroll down to the bottom and click on Start MultAlin.
Look at the alignment. Computers are very fast and efficient, but they don’t think! From what you know about the splicing signals of introns, how do you think this alignment should be corrected?
Name: ____________________________
Major: ____________________________
How much time did you spend on this exercise? (Include 50 min. of discussion time).
_____________
Study questions:
Part I:
14. Why are there six reading frames, rather than three, in the query of a BLASTx search?
15. What will happen if you do not use the ciliate codon usage?
16. Why is the range in numbers of the query sequence three times as large as the range of numbers in the subject sequences for each hit in this search?
Why did you get so many more hits with the BLASTx search than you did with the BLASTn search of the same sequence?
17. The conserved domains are in the C-terminal half of the protein. What is the function of this conserved domain?
From your inspection of the conserved domains, what do you think a consensus sequence is?
Part II
1. The Tetr. cDNA #1 sequence was obtained as a RT-PCR artifact. What does that suggest the sequence is? That is, is it a clone of genomic DNA or cDNA? Is the sequence expected to contain introns or not?
2. Which program and data base did you choose to determine the identity of the unknown cDNA clone?
If you did not do BLASTx, do that now. If you did not choose BLASTx at first, what is the difference between the result you got and the one with BLASTx?
What is the likely function of the Tetrahymena gene for which you have the partial cDNA clone?
What is the expect value for the best match? __________
4. What part of the subject proteins aligns with your query sequence?
What part of the gene encoded by the partial cDNA clone is likely to be missing?
5. Which BLAST program did you use to obtain additional DNA sequence for your gene?
How do you explain that fact that the query sequence matches to two different parts of the scaffold?
6. Computers are fast and powerful, but they don’t think. How would you revise the alignment of the cDNA clone with the genomic clone?