GCP Home Page


Terms Commonly Used in Genomics Research

A  B  C  D  E  F  G  H  I  J  L  M  N  O  P  R  S  T


accession number: an alphanumerical code which identifies a DNA sequence in a database.

algorithm: a procedure embedded in a computer program.

alignment: the process of comparing two or more DNA sequences to assess their degree of identity.

alternative splicing: the mechanism by which different introns (intervening sequences found within a gene) are removed during transcription, which results in the formation of variant mRNA messages from a single gene.

amino acid: a simple class of organic compounds, 20 of which are used as the building blocks of proteins. Each amino acid bears both a carboxy (COOH) and an amino (NH2) group. The codon sequence determines the sequence of amino acids in the gene product. Four of the 20 biologically significant amino acids are alanine, glycine, arginine and leucine.


base/base pair: the four nitrogenous subunits (nucleotides) of DNA: adenine (abbreviated as A), guanine (G), cytosine (C), and thymine (T). In the DNA molecule, they are linked to one another in pairs of long chains, where each member of the pair is complementary to the other. This double-stranded chain is itself twisted into a double helix. The complementarity between the strands is brought about by the interaction between A and T, and between G and C. Since the identity of a base on one strand can be used to infer the identity of the corresponding base on the other strand,  the terms “base” and “base pair” are often used interchangeably. The number of bases (or base pairs) is used as a measurement of the size of a genome. For example, the length of the human genome is approximately 3 billion base pairs (abbreviated bp).

BAC: an abbreviation for “bacterial artificial chromosome”. These are vectors designed to carry large pieces of inserted DNA. They can be propagated in E. coli, and so are used for cloning and other molecular biology purposes.

bioinformatics: a research discipline combining computer science, biology, and information technology, targeting the storage, management and analysis of large amounts of biological data.

BLAST: an abbreviation for “basic local alignment search tool”. This is a sequence comparison algorithm much used for DNA alignment. It is available online through NCBI.


cDNA: an abbreviation for “complementary DNA”. This is the in vitro transcription product from mRNA. cDNA molecules usually lack intron sequence.

chromosome: the structure in the eukaryotic nucleus and in the prokaryotic cell which carries most of the DNA. Prokaryotes have a single chromosome, but in eukaryotes, the diploid number varies from two pairs to hundreds. The variation is particularly notable in the plant kingdom.

codon: a set of 3 nucleotides in a DNA sequence, which encodes a specific amino acid.

comparative genomics: an approach which sets out to compare the sequences of two or more related organisms. It is frequently used as a means of identifying gene functions and for forming evolutionary hypotheses.

computational biology: the analysis and interpretation of biological data.

C0T analysis: a method to distinguish between highly repetitive and low copy DNA sequences, which uses the principle of DNA renaturation kinetics, in which the rate at which a particular single-stranded sequence returns to the double-stranded state depends on the number of times it is found in the genome. In particular, the method is used to enrich a preparation of genomic DNA for low copy sequences (more likely to be genes).


database: a collection of data. See also relational database.

DNA: an abbreviation for “deoxyribose nucleic acid”, the carrier molecule of genetic information. The chain of  nucleotides is held together on a polymer backbone formed by a sugar (deoxyribose) and a phosphate group (see also base).

DNA chip: see microarray.

DNA fingerprinting: the creation of a unique genetic profile of an individual based on its DNA.

DNA sequence: the sequence of bases forming the DNA molecule. They are always expressed as a sequence of the four letters, each of which represents one of the four bases - for example GCATATTGCT.


EST: an abbreviation for “ expressed sequence tag”. These represent fragments of gene sequences, and are obtained by single-pass sequencing of cDNA. They have been heavily used for gene discovery, particularly in organisms that have not yet been sequenced, and also as a source of sequence to design genic molecular markers.

exon: the part of a DNA sequence which encodes a protein (usually in conjunction with other exons).


FastA: the first widely used search algorithm for database similarity searching; now sometimes used simply to denote the file format in which sequences are commonly expressed.

functional genomics: the study of the structure, organization and function of a genome during developmental and other life processes of an organism.


gap: a space introduced into a DNA alignment to compensate for insertions and deletions in one sequence relative to another.

GenBank: the most frequently accessed public domain database for DNA sequence data and related information. Managed by NCBI, supported by the National Library of Medicine and NIH, available at http://www.ncbi.nlm.nih.gov .

gene: the unit of heredity, transmitted from generation to generation during reproduction. Each gene consists of a sequence of nucleotides, occupying a specific position along a chromosome. Most genes encodes a specific functional product.

gene expression: the process in which a gene is actively transcribed or "turned on".

gene family: a group of closely related sequences which probably encode functionally similar products.

genetic engineering: the technique of cloning a gene from one organism, and then adding it to another. Also refers to methods for altering gene expression, without necessarily introducing genes from another species. The rationale is most commonly to introduce or enhance a trait, to the benefit of the recipient, the producer, the environment or the consumer.

genome: the entire genetic content of an organism. Genome size varies widely among organisms.

genotype: a genetic constitution of an organism, see also phenotype.

GMO: an abbreviation for “genetically modified organism”. Although technically this could refer to genetic modification through conventional breeding and selection, typically the term specifically is applied to organisms modified by genetic engineering. Also called transgenics.


haplotype: the specific allelic constitution within a sequence which is always inherited as a unit. For example, within a 1,000bp sequence, there may be four bases which vary in a population (the other 996 being identical for every member of the population). The haplotype of each individual is defined by the combination of the four variable bases present in the target sequence.

heuristic: a procedure which derives an approximate solution in a more economical or faster way than can the more mathematically "strict" algorithm. In computer science, heuristics are applied when an exact solution is computationally impractical.

homology: the degree of identity between two DNA or amino acid sequences. Originally homology referred to the degree of identity between two individuals, which followed from their having a common evolutionary origin.


imprinting: the phenomenon whereby a gene is expressed differently in an offspring depending on whether it was inherited from its father or its mother.

intron: a DNA sequence within a gene which interrupts the exons, and is not usually transcribed.


junk DNA: describes non-coding DNA, although much of it probably has a function, such as to  stabilize the structure of the genome or to control gene expression.


library: a set of DNA sequences or clones.


mapping: the process of identifying the location of a gene or DNA segment along a chromosome. In genetic mapping, this is done by analyzing patterns of inheritance in segregating populations (measured in recombinational units, commonly centiMorgans). In physical mapping, this describes the actual location of a sequence in a particular genomic region (measured in bp).

the study of the global small molecule metabolite output of a specific cellular process or set of processes.

microarray (or DNA chip, gene chip): a device in which a minute amount of each of many thousands of genic and/or other DNA sequences is immobilized on a glass or plastic support. When hybridized with a preparation of labeled cDNA, they are used to simultaneously measure the expression levels of all the sequences present on the chip.

minimal tiling path: the smallest number of overlapping clones (usually BACs) needed to generate a larger sequence. Overlaps are defined by the ability of two clones to hybridize successfully with one another.

molecular marker: a gene or DNA fragment with a known location on a chromosome. (For a good tutorial on the uses of markers, see the downloadable training materials available from the International Plant Genetic Resources Institute, http://www.ipgri.cgiar.org/ .

mutation: an abrupt change in the genotype of an organism which is not the result of recombination.


NCBI: abbreviation for “National Center for Biotechnology Information”, the organization which manages GenBank, PubMed (a database of publications), and other databases (available at http://www.ncbi.nlm.nih.gov ).

nucleic acid: see base/base pair and DNA.

nucleotide: the unit of DNA, consisting of one base, one phosphate molecule, and the sugar  deoxyribose. See also base/base pair


ortholog: a copy of a gene present in more than one related species. Orthologs are assumed to have derived from a common ancestral gene at the time of the last common ancestor.


paralog: a copy of a gene present in the same species. Paralogs arose from gene duplication.

PCR: abbreviation for “polymerase chain reaction”, the process by which a defined fragment of DNA is replicated in vitro in a so-called thermocycler or PCR machine. These devices are designed to control the temperature and the the time over which a particular temperature is held.

the visible appearance of an (with respect to one or several traits). The phenotype reflects the combined action of the genotype and the environment where the individual exists.

phylogenetics: the field of biology which attempts to identify and understand relationships between the various life forms.

phylogenomics: a method of assigning a function to a gene based on its evolutionary history in a phylogenetic tree; phylogenomics uses information related to the evolution of a gene to improve the prediction of gene function.

polyploidy: a state in which multiple copies of a complete genome are present. Polyploidy is rare in animals, but common in plants. In animals (and also plants) some tissues within a diploid organism can be polyploid. The polyploid series is haploid (1 copy), diploid (2 copies), triploid (3 copies), tetraploid (4 copies), pentaploid (5 copies), hexaploid (6 copies) etc.

promoter: the part of a gene which is used to control the gene's expression.

large molecules composed of amino acids. Proteins are involved in many cellular structures, and are key to the catalysis of most reactions within the living cell.

proteome: the set of all proteins in a cell. Unlike the relatively static genome, the dynamic proteome changes from minute to minute in response to many intra- and extracellular environmental signals.

proteomics: the large-scale analysis of an organism's proteins to reveal expression and functions.


recombination: the formation among the offspring of a mating of genetic combinations not present in either parent, achieved via the physical exchange of genetic material during meiosis.

regulatory DNA: DNA which controls the activity of genes. These DNA sequences tend to be short and are usually located close the genes they control.

relational database: a database which cross-references the different types of data it contains, and allows queries of any type (a sequence, the sequence name, etc.) to retrieve data.

RNA: an abbreviation for “ribonucleic acid”, the molecule responsible for translating DNA into proteins. Made up of a single chain of nucleotides (the same bases as in DNA, except that uracil replaces thymine). There are three main types of RNA: messenger RNA, transfer RNA, and ribosomal RNA.

RNA interference (RNAi): a natural process used by the cell to “turn off,” or silence, a particular gene or gene family. Scientists can now use a transgenic approach which mimics this process, and therefore can manipulate gene expression. In research it is currently being heavily used to identify the function of various genes, by studying the phenotypic effect of turning these genes off.


sequencing: determining the order (sequence) of bases in DNA, or amino acids in a protein.

SNP: an abbreviation for “single nucleotide polymorphism”, pronounced "snip". A SNP which distinguishes two sequences can be used as a genetic marker.

structural genomics: an approach to identifying the 3-D structure of proteins, which will help identify their functions and provide targets for drug design.

synteny: the occurrence of two or more orthologs on the same chromosome in different species, without regard to gene order. Increasingly used to include conservation of gene order as well, although this is better described by the term “collinearity”.


transgenic: an organism containing genetic material from another organism transferred by genetic engineering. See also GMO.

transcription: the process in which RNA is formed from DNA.

transcriptome: the parts of the genome which are transcribed.
transcriptomics: a means of depicting the expression level of many genes, typically based on  microarray technology.

transformation: the process of adding a gene from one organism into another.

transposon: a genetic element which is able within the genome.


unigene: a representation of a gene family, used to avoid the appearance of highly redundant sequences in EST libraries.

universal primers: a PCR primer pair which can amplify a set of orthologs.

UTR: an abbreviation for “untranslated region”, that part of a gene sequence which is not translated into a protein.

Main sources and other glossaries

Chemis Interactive Molecular Library: nucleic acids http://www.geneticengineering.org/chemis/Chemis-NucleicAcid/DNA.htm , 2000, Dr Didier Collomb 2/13/02

Friend, S.H. and Stoughton, R.B. (2002, February). The magic of microarrays. Scientific American, pp. 44-53

Hartwell, L.H., Hood, L., Goldberg, M., Reynolds, A.E., Silver, L.M., & Veres, R.C. (2000). Genetics: from genes to genomes. New York: McGraw-Hill Companies, Inc.

Interagency Working Group on Plant Genomes (2000). National Plant Genome Initiative. Washington, D.C.: National Science and Technology Council

Genomics Initiative, a supplement to the Cornell Chronicle. (1999, January). Cornell University

Glossary of Biotechnolgy for Food and Agriculture. FAO Research and Technology Paper #9.

Human Genome Management Information System (HGMIS) (2001). Genomics and its impact on medicine and society: a primer, [pdf]. HGMIS at Oak Ridge National Laboratory, Oak Ridge, TN, for the U.S. Department of Energy Human Genome Program. Available at http://www.ornl.gov/hgmis

National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/

National Institutes of Health, National Institute of General Medical Sciences (2001) Genetics Basics. NIH Publication No. 01-662. Also available at: http://publications.nigms.nih.gov/genetics/

Genome News Network glossary http://www.genomenewsnetwork.org/

Wikipedia, the free encyclopedia http://en.wikipedia.org/

For reviews of some online glossaries in genomics and biotechnology, see http://www.sciencegenomics.org