This is a small article which attempts to clarify for me (and perhaps for others) the size and the information content of the human genome. Many thanks are due to Larry Moran, John Wilkins, and Paul Myers for informative correspondence and articles. All errors are the product of yours truly.
The human genome consists of approximately 3200 million base pairs of nucleotides arranged in 46 chromosomes. These numbers are fairly straightforward to establish by physical measurement.
The genome is divided up into unique sequence DNA and repetitive DNA. Approximately 1/3 of the genome is unique sequence DNA; the remainder is repetitive DNA. All of the genes are located in the unique sequence component. Genes take up more than 30% of this fraction. This works out to about 10% of the genome being devoted to genes. This works out to 320 million base pairs devoted to genes.
A gene is not a single sequence of coding nucleotides. It has introns (sections of non-coding nucleotides) within it. Approximately half of the nucleotides within a gene are within introns. Thus the percentage of the genome that actually is coding DNA is about 5% (it may be as low as 3%.) Thus we have:
Genome DNA | 3200 million base pairs |
Unique sequence DNA | 1000 million base pairs |
DNA comprising genes | 320 million base pairs |
Coding DNA | 160 million base pairs |
The number of genes in the genome is not known. Early estimates ranged from 10,000 to 170,000. Currently the number of genes is estimated to be about 100,000 However a number of genes are duplicated; the number of distinct genes may be as low as 50,000.
The first step in transcribing coding DNA is to produce RNA. Each coding nucleotide codes for a unique nucleotide in the the corresponding RNA. Most of these RNA sequences are then transcribed into protein using the genetic code. A small percentage (approximately 5% to 10%) of the genes encode into functional RNA. There are four types of ribosomal RNA (rRNA), about 100 types of transcription RNA (tRNA), and about 50 other types of functional RNA. There are multiple copies of most of these genes, as many as 1000 copies of the rRNA genes.
The genes which code for protein use three nucleotides for each amino acid in the protein using the genetic code. It is estimated that the average length of a protein is about 500 amino acids long. This corresponds to 1500 nucleotides in the coding part of the gene. If there are 100,000 genes with 1500 coding nucleotides per gene then the number of coding base pairs would be 150 million.
It is instuctive to translate these numbers into the jargon of computer science. A base pair can have any of four possible values – A, G, C, or T – so that each base pair represents 2 bits of information. Because of the redundancy in the genetic code (3 nucleotides encode into 20 amino acids) a base represents about 1.5 bits of information. Thus the coding nucleotides represent about 30 megabytes of information.
If we think of the genome as the equivalent of a computer program this is a surprisingly small number. Today many computer programs are much larger. If we think of the genome as a data description of the human body it is my professional opinion that 30 megabytes is far too small a number to be a data description. Neither model (computer program or data description file) is appropriate for the genome however. A better analogy would be to think of the genome as a dictionary of agents not unlike the java applets so currently fashionable. The bulk of the information is carried dynamically in the cell itself.
This page was last updated June 13, 1997.