The Scripps Research Institute

Interpreting the Human Genome Through Computational Biology

By Jason Socrates Bardi

“He went through a brand new planet. Paris rebuilt. Ten thousand incomprehensible avenues.”

——From the film La Jettee by Chris Marker, 1963.

It has been called nothing short of the largest basic science undertaking ever, an erudite feat with a guaranteed place at the table of history, no less impressive than the building the atomic bomb or landing Neil Armstrong and company on the moon: the sequencing of the human genome. Solved to over 93 percent completion and 99 percent accuracy last year by public and private efforts, the genome was published in the two major scientific journals last week by both groups in watershed papers.

Gaps and errors notwithstanding, the sequence shows the correct order and location of all 3.12 billion plus base pairs of DNA along the 23 chromosome pairs which comprise the totality of human genetic information.

Yet at the moment the genome remains deeply shrouded.

Sequencing the genome is just the beginning of the story. We must now tackle the no less impressive undertaking of figuring out what the genome means. Where are the genes and what are they doing? How is the expression of these genes regulated? How do our genes compare with similar genes in other organisms? How do different gene products interact with each other? How do slight changes in them lead to heritable traits and diseases? Can we make pharmaceuticals to target the genes or their products, turning them on or off to treat illnesses?

These questions, says Professor Ruben Abagyan of The Scripps Research Institute (TSRI) Department of Molecular Biology, are going to launch a thousand inquiries in a thousand labs across the country. He compares the coming flurry to the 1849 California gold rush.

For this reason, Abagyan and his colleagues at TSRI are among those who are preparing to take on the important task of finding ways to annotate the genome and find hidden treasures in it, i.e. to “mine the genome.”

Annotation is everything from identifying the genes within the sequence to finding their function, functional families, structures, interacting proteins, and ligands. Annotation is the interpretation of the information, the meaning of the words, the knowledge rather than the data.

Finding new ways of organizing this information is necessary because as scientists discover more and more about all the parts of the genome, the amount of information explodes—and fragments.

Scientists submit their data to many separate databases, each with its own specialty. Genomic information from different species, genes, proteins and protein families, expression levels and tissue distributions, individual sequence differences (SNPs) and associated phenotypes, small biologically active molecules, and, finally three dimensional structures of biological polymers and their complexes.

A Whole Less than the Sum of its Parts

How can computational biologists create the systems to access this information in a meaningful way?

"The first step," says Abagyan, "is creating an environment where you can actually browse all the information—query, extract, and analyze as you wish."

This is no easy task. Connecting each gene to other databases via ordinary hypertext links, as one might imagine doing at first, would not be a realistic way to organize the information. An annotated gene would point to several different items in several different databases, each with a different arbitrary format and naming convention and constructions.

For instance, a single human Ig domain in a single gene would point to hundreds of other Ig domains in other human genes and thousands in other organisms. The domain may also be linked to similar domains in a domain database, may have links to thousands of protein domains in a structure database. The gene itself may be linked to countless other genes through similarities in sequence, structure, function, family, chromosome, or organism.

Perhaps one could surf the genomic web in this form, but who would want to? Each gene would be endlessly linked to a plurality of self referential sites and each new organism would add another order of complexity to the tangled mess. "After three genes, you’ll be exhausted," says Abagyan.

One could also simply look within one database or another for genomic information. But this would deny the promise of discovery that is the human genome sequence. "If you just take a piece of the genome in isolation, it’s not interesting, basically," says Abagyan. "These [databases] have all the little bits and pieces that we have to put together to make the genome alive. Otherwise, it’s just a sequence."

Abagyan and his TSRI colleagues are creating an environment in which all the information can be sorted and the redundancies removed. The individual databases must be able to be consumed, combined, digested, and displayed in a standardized, relational form. "Right now it’s a complete mess," he says. "The entire deck is shuffled."

The ultimate goal will be to produce a functional map of the human genome, where all the genes are identified and understood. A protein catalog of clustered genes represented by a hierarchical set of folders, using a standardized set of annotations and conventions within. And one that would contain 10 fold less information than all its constituents.

Abagyan believes that such a map is not that far off, a few of years, perhaps. "It feels like it’s within reach," he says.

Next Page | Twilight Zone Chemistry and Virtual Inhibition

1 | 2 |

Like many others in bioinformatics, the field that stands between molecular biology and computer science, Abagyan has his roots in traditional computational biology—homology modeling, molecular modeling, and docking. He wants to extend these techniques to work on the human genome.