Other help topics:    For The Public

Sequence Info Section

Introduction

The Sequence section provides information about the protein sequence on which the Molecule Page is based. Some properties of the sequence are displayed, along with information from the Entrez Gene and Ensembl gene databases, and a list of protein database records (e.g. UniProt, Refseq, Genbank) related to the Molecule Page. The Sequence section is a combination of the former Protein Records and Gene Info sections, as well as the sequence properties that were listed on the Protein Overview page.

Protein Sequence Properties

Accession

The database identifier for the protein sequence record assigned to the Molecule page. A hyperlink to the record at NCBI is provided, as well as a hyperlink to the Fasta text format of the sequence.

Sequence Length

Number of nucleotide base pairs (or amino acid residues) in the sequence record.

Molecular Weight

The computed molecular weight of the amino acid sequence, in Kilodaltons (kD).

Isoelectric point

The computed isoelectric point of the amino acid sequence. The isoelectric point is the pH of a solution at which the net charge on the macromolecules is zero.

Extinction Coefficient

The computed extinction coefficient of the amino acid sequence at 280 nm. The extinction coefficient is a measure of the amount of light absorbed by a one molar solution in a centimeter cuvette, in units of per Mol per centimetre. The calculation and assumptions are described at the ExPASy website.

Absorption Coefficient

The computed absorption coefficient of the amino acid sequence. The absorption coefficient, which is also known as optical density, is a measure of the amount of radiant energy, incident normal to a planar surface, that is absorbed per unit distance and unit mass of a substance. The absorption coefficient is given by the ratio of the extinction coefficient to the molecular weight.

Aliphatic Index

The computed aliphatic index of the amino acid sequence. The aliphatic index is defined as the relative volume of a protein occupied by aliphatic side chains and is dimensionless. The calculation and assumptions are described at the ExPASy website.

Entrez Gene Information

Official Name

The primary or official name assigned to the Entrez Gene record, usually assigned by Mouse Genome Informatics (MGI).

Official Symbol

The primary or official symbol assigned to the Entrez Gene record, usually assigned by Mouse Genome Informatics (MGI).

Entrez Gene ID

The Entrez Gene ID is the unique identifier of the NCBI Entrez Gene record. This used to be the LocusLink ID.

Gene Type

The type of Entrez Gene record, as defined by NCBI. This basically describes category that the locus belongs to, for example a gene that encodes a protein product of known function.

Chromosome

The chromosome(s) to which the particular Entrez Gene record has been mapped.

Location

The genetic position of the Entrez Gene mapping.

Alternate Names

Other names used in literature to describe the gene.

Alternate Symbols

Other symbols used in literature to describe the gene.

Primary Source

The source database record for the Entrez Gene record. The source databases include MGI (Mouse Genome Database) and HGNC (HUGO gene nomenclature committee - human).

Related Gene Ontology

Entrez Gene references to GO (Gene Ontology) categories are displayed here.

Related Database Records

Gene-related database records that are linked to the Entrez Gene record, such as UniGene, Ensembl, OMIM, the Human Protein Resource Database (HPRD), and the Rat Genome Database (RGD).

Ensembl Gene Information

Gene Report

This is the Ensembl Gene ID, which is hyperlinked to the Gene Report Page at Ensembl.

Gene View

This is the genomic location of the Ensembl gene, which is hyperlinked to the Ensembl Contig Viewer.

Chromosome

The chromosome(s) to which the particular Ensembl record has been mapped.

References Molecule Page Protein

If the protein sequence database record assigned to this Molecule Page ID, or any of its identical sequence records or known variants, are referenced by this Ensembl record, the value of this field is yes. If this Ensembl record was found by sequence identity or gene reference only, then the value of this field is no.

Protein Database Records

Type

A type of "Reference" indicates that the database record was chosen by the Editorial Staff as the canonical sequence for this particular Molecule Page ID, meaning that it defines this Molecule Page. A type of "Identical" indicates that the sequence in this database record is identical to that of the Reference sequence (including being of the same species). A sequence type of "Variant" indicates a variant of the Reference sequence (this includes splice variants, SNPs affecting coding, sequencing conflicts, and fragments). A sequence type of "Unknown" is related to the Reference sequence (usually by identity), but has no database reference as verification.

Seq Id

A unique identifier of the sequence, for any given Molecule Page. Sequences with the same Seq ID are identical.

Length

The sequence length in number of amino acids.

Database

The name of the database that is the source of the protein record. Most of these are self-explanatory. The UniProt (SwissProt and TrEMBL) variants are noted variant sequences (either splice variants, variants, or conflicts) of the primary sequence in that record.

Accession

This is an identifier of the database record. When database records have multiple accessions, this field is the primary accession (i.e. the first one in the accession list). For UniProt sequences, the accession is hyperlinked to PIR's UniProt server. For Ensembl records, the accession is hyperlinked to the Protein Report page.

Entry

For UniProt records, the entry field is the unique identifier for the record, though it is not as stable of an identifier as the accession. The entry field is hyperlinked to EBI's UniProt server. For Ensembl records, the Entry field contains the Ensembl Gene ID related to this record, and is hyperlinked to the Ensembl Gene Report page.

UniProt variants use an alternate form of the entry name, which enumerates the splice variant, variant, and conflicts given in the database record. A value of 0 means it is the standard form. For example, KAPA_MOUSE-1-0-0 refers to a splice variant, KAPA_MOUSE-0-0-1 refers to a sequence conflict, and KAPA_MOUSE-1-0-1 is a combination of the splice variant and the conflict.

GI

The GI number of the protein database record at NCBI. SwissProt records have GI numbers (as they are part of the NCBI Entrez system), but TrEMBL records do not, nor do the UniProt variants. The GI numbers are hyperlinked to NCBI's Entrez browser.

Name

For the sequences with a GI number, this is the title used in the Fasta header in their non-redundant protein Fasta file. For UniProt sequences, it is a combination of the description (DE) and organism (OS) lines.

How Protein Records Are Obtained

Molecule Pages are defined by a unique protein sequence, generally one that is associated with a unique database record which NCBI has assigned a GI number. A combination of Entrez Gene (and MGD for mouse sequences), UniProt references, sequence, and taxonomy information is used to find related protein records.

The first step in finding relating protein database records is to relate the defining protein database accession or GI number to a Entrez Gene record, if possible. The Entrez Gene record is used to generate a list of known related records. UniProt references to EMBL translations are used as an additional mechanism to link the UniProt record to Entrez Gene.

Taxonomy grouping is somewhat more complex than the use of a specific species (i.e. a specific NCBI Taxonomy ID). Because Entrez Gene refers to a few species variants in some of its mouse and rat records (for example, Mus pahari, Rattus rattus, etc.), we chose to use everything in the Mus genus to be part of the mouse sequence record group, and everything in the Rat genus to be part of the rat sequence record group.

Since we also have UniProt variants in our non-redundant protein database, we include some of them in the list of protein database records. This adds to the information known about all the related protein records (for example, knowing that a UniProt variant has the same sequence as a Refseq sequence).

After the list of related records has been compiled, another pass of the analysis is made, by the use of sequence and species identity. All records with the same sequence and species of all the related records are found, and if any did not previously show up as being related, they are considered possibly related. All those records are then compared to Entrez Gene and UniProt in the same fashion as before.