Most genes in human cells have quite a similar structure. There’s a region at the beginning called the promoter, which binds the protein complexes that copy the DNA to form mRNA. The protein complexes move along through what’s known as the body of the gene, making a long mRNA strand, until they finally fall off at the end of the gene.
Imagine a gene body that is 3,000 base-pairs long, a perfectly sensible length for a gene. The mRNA will also be 3,000 base-pairs long. Each amino acid is encoded by a codon composed of three bases, so we would predict that this mRNA will encode a protein that is 1,000 amino acids long. But, perhaps unexpectedly, what we find is that the protein is usually considerably shorter than this.
If the sequence of a gene is typed out it looks like a long string of combinations of the letters A, C, G and T. But if we analyse this with the right software, we find that we can divide that long string into two types of sequences. The first type is called an exon (for expressed sequence) and an exon can code for a run of amino acids. The second type is called an intron (for inexpressed sequence). This doesn’t code for a run of amino acids. Instead it contains lots of the ‘stop’ codons that signal that the protein should come to an end.
When the mRNA is first copied from the DNA it contains the whole run of exons and introns. Once this long RNA molecule has been created, another multi-sub-unit protein complex comes along. It removes all the intron sequences and then joins up the exons to create an mRNA that codes for a continuous run of amino acids. This editing process is called splicing.
This again seems extremely complicated, but there’s a very good reason that this complex mechanism has been favoured by evolution. It’s because it enables a cell to use a relatively small number of genes to create a much bigger number of proteins. The way this works is shown in Figure 3.3.
Figure 3.3 The DNA molecule is shown at the very top of this diagram. The exons, which code for stretches of amino acids, are shown in the dark boxes. The introns, which don’t code for amino acid sequences, are represented by the white boxes. When the DNA is first copied into RNA, indicated by the first arrow, the RNA contains both the exons and the introns. The cellular machinery then removes some or all of the introns (the process known as splicing). The final messenger RNA molecules can thereby code for a variety of proteins from the same gene, as represented by the various words shown in the diagram. For simplicity, all the introns and exons have been drawn as the same size, but in reality they can vary widely.
The initial mRNA contains all the exons and all the introns. Then it’s spliced to remove the introns. But during this splicing some of the exons may also be removed. Some exons will be retained in the final mRNA, others will be skipped over. The various proteins that this creates may have quite similar functions, or they may differ dramatically. The cell can express different proteins depending on what that cell has to do at a particular time, or because of different signals that it receives. If we define a gene as something that encodes a protein, this mechanism means that just 20,000 or so genes can code for far more than just 20,000 proteins.
Whenever we describe the genome we talk about it in very two-dimensional terms, almost like a railway track. Peter Fraser’s laboratory at the Babraham Institute outside Cambridge has published some extraordinary work showing it’s probably nothing like this at all. He works on the genes that code for the proteins required to make haemoglobin, the pigment in red blood cells that carries oxygen all around the body. There are a number of different proteins needed to create the final pigment, and they lie on different chromosomes. Doctor Fraser has shown that in cells that produce large amounts of haemoglobin, these chromosome regions become floppy and loop out like tentacles sticking out of the body of an octopus. These floppy regions mingle together in a small area of the cell nucleus, waving about until they can find each other. By doing this, there is an increased chance that all the proteins needed to create the functional haemoglobin pigment will be expressed together at the same time[18].
Each cell in our body contains 6,000,000,000 base-pairs. About 120,000,000 of these code for proteins. One hundred and twenty million sounds like a lot, but it’s actually only 2 per cent of the total amount. So although we think of proteins as being the most important things our cells produce, about 98 per cent of our genome doesn’t code for protein.
Until recently, the reason that we have so much DNA when so little of it leads to a protein was a complete mystery. In the last ten years we’ve finally started to get a grip on this, and once again it’s connected with regulating gene expression through epigenetic mechanisms. It’s now time to move on to the molecular biology of epigenetics.
Chapter 4. Life As We Know It Now
The important thing in science is not so much to obtain new facts as to discover new ways of thinking about them.
So far this book has focused mainly on outcomes, the things that we can observe that tell us that epigenetic events happen. But every biological phenomenon has a physical basis and that’s what this chapter is about. The epigenetic outcomes we’ve described are all a result of variations in expression of genes. The cells of the retina express a different set of genes from the cells in the bladder, for example. But how do the different cell types switch different sets of genes on or off?
The specialised cell types in the retina and in the bladder are each at the bottom of one of the troughs in Waddington’s epigenetic landscape. The work of both John Gurdon and Shinya Yamanaka showed us that whatever mechanism cells use for staying in these troughs, it’s not anything to do with changing the DNA blueprint of the cell. That remains intact and unchanged. Therefore keeping specific sets of genes turned on or off must happen through some other mechanism, one that can be maintained for a really long time. We know this must be the case because some cells, like the neurons in our brains, are remarkably long-lived. The neurons in the brain of an 85-year-old person, for example, are about 85 years of age. They formed when the individual was very young, and then stayed the same for the rest of their life.
But other cells are different. The top layer of skin cells, the epidermis, is replaced about every five weeks, from constantly dividing stem cells in the deeper layers of that tissue. These stem cells always produce new skin cells, and not, for example, muscle cells. Therefore the system that keeps certain sets of genes switched on or off must also be a mechanism that can be passed on from parent cell to daughter cell every time there is a cell division.
This creates a paradox. Researchers have known since the work of Oswald Avery and colleagues in the mid-1940s that DNA is the material in cells that carries our genetic information. If the DNA stays the same in different cell types in one individual, how can the incredibly precise patterns of gene expression be transmitted down through the generations of cell division?
Our analogy of actors reading a script is again useful. Baz Luhrmann hands Leonardo DiCaprio Shakespeare’s script for Romeo and Juliet, on which the director has written or typed various notes – directions, camera placements and lots of additional technical information. Whenever Leo’s copy of the script is photocopied, Baz Luhrmann’s additional information is copied along with it. Claire Danes also has the script for Romeo and Juliet. The notes on her copy are different from those on her co-star’s, but will also survive photocopying. That’s how epigenetic regulation of gene expression occurs – different cells have the same DNA blueprint (the original author’s script) but carrying varied molecular modifications (the shooting script) which can be transmitted from mother cell to daughter cell during cell division.