Interpreting the Human Genome Through Computational Biology
By Jason Socrates Bardi
He
went through a brand new planet. Paris rebuilt. Ten thousand
incomprehensible avenues.
From
the film La Jettee by Chris Marker, 1963.
It has been called nothing short of the largest basic science
undertaking ever, an erudite feat with a guaranteed place
at the table of history, no less impressive than the building
the atomic bomb or landing Neil Armstrong and company on the
moon: the sequencing of the human genome. Solved to over 93
percent completion and 99 percent accuracy last year by public
and private efforts, the genome was published in the two major
scientific journals last week by both groups in watershed
papers.
Gaps and errors notwithstanding, the sequence shows the
correct order and location of all 3.12 billion plus base pairs
of DNA along the 23 chromosome pairs which comprise the totality
of human genetic information.
Yet at the moment the genome remains deeply shrouded.
Sequencing the genome is just the beginning of the story.
We must now tackle the no less impressive undertaking of figuring
out what the genome means. Where are the genes and what are
they doing? How is the expression of these genes regulated?
How do our genes compare with similar genes in other organisms?
How do different gene products interact with each other? How
do slight changes in them lead to heritable traits and diseases?
Can we make pharmaceuticals to target the genes or their products,
turning them on or off to treat illnesses?
These questions, says Professor Ruben Abagyan of The Scripps
Research Institute (TSRI) Department of Molecular Biology,
are going to launch a thousand inquiries in a thousand labs
across the country. He compares the coming flurry to the 1849
California gold rush.
For this reason, Abagyan and his colleagues at TSRI are
among those who are preparing to take on the important task
of finding ways to annotate the genome and find hidden treasures
in it, i.e. to mine the genome.
Annotation is everything from identifying the genes within
the sequence to finding their function, functional families,
structures, interacting proteins, and ligands. Annotation
is the interpretation of the information, the meaning of the
words, the knowledge rather than the data.
Finding new ways of organizing this information is necessary
because as scientists discover more and more about all the
parts of the genome, the amount of information explodesand
fragments.
Scientists submit their data to many separate databases,
each with its own specialty. Genomic information from different
species, genes, proteins and protein families, expression
levels and tissue distributions, individual sequence differences
(SNPs) and associated phenotypes, small biologically active
molecules, and, finally three dimensional structures of biological
polymers and their complexes.
A Whole Less than the Sum of its Parts
How can computational biologists create the systems to access
this information in a meaningful way?
"The first step," says Abagyan, "is creating an environment
where you can actually browse all the informationquery,
extract, and analyze as you wish."
This is no easy task. Connecting each gene to other databases
via ordinary hypertext links, as one might imagine doing at
first, would not be a realistic way to organize the information.
An annotated gene would point to several different items in
several different databases, each with a different arbitrary
format and naming convention and constructions.
For instance, a single human Ig domain in a single gene
would point to hundreds of other Ig domains in other human
genes and thousands in other organisms. The domain may also
be linked to similar domains in a domain database, may have
links to thousands of protein domains in a structure database.
The gene itself may be linked to countless other genes through
similarities in sequence, structure, function, family, chromosome,
or organism.
Perhaps one could surf the genomic web in this form, but
who would want to? Each gene would be endlessly linked to
a plurality of self referential sites and each new organism
would add another order of complexity to the tangled mess.
"After three genes, youll be exhausted," says Abagyan.
One could also simply look within one database or another
for genomic information. But this would deny the promise of
discovery that is the human genome sequence. "If you just
take a piece of the genome in isolation, its not interesting,
basically," says Abagyan. "These [databases] have all the
little bits and pieces that we have to put together to make
the genome alive. Otherwise, its just a sequence."
Abagyan and his TSRI colleagues are creating an environment
in which all the information can be sorted and the redundancies
removed. The individual databases must be able to be consumed,
combined, digested, and displayed in a standardized, relational
form. "Right now its a complete mess," he says. "The
entire deck is shuffled."
The ultimate goal will be to produce a functional map of
the human genome, where all the genes are identified and understood.
A protein catalog of clustered genes represented by a hierarchical
set of folders, using a standardized set of annotations and
conventions within. And one that would contain 10 fold less
information than all its constituents.
Abagyan believes that such a map is not that far off, a
few of years, perhaps. "It feels like its within reach,"
he says.
1 | 2 |
|
Like many others in bioinformatics,
the field that stands between molecular biology and computer
science, Abagyan has his roots in traditional computational
biologyhomology modeling, molecular modeling, and docking.
He wants to extend these techniques to work on the human genome.
|