Other help topics:    For The Public

Protein Records Section

Introduction

The Protein Records section of a Molecule Page lists all the links to protein sequence database records that are related or may be related to this particular Molecule Page. The links are broken into distinct categories.

Protein Record Categories

Is Defined By

This database record was chosen by the Editorial Staff as the canonical sequence for this particular Molecule Page ID, meaning that it defines this Molecule Page.

This Molecule's Sequence Is Identical To

The sequence in this database record is identical to that of the record used to define this Molecule Page. For a sequence to be in the "is identical to" category, it must be in the same species group as that of the defining database record, must have the same amino acid sequence as that of the defining sequence (allowing for a leading methionine), and there must be references that relate the record to the defining database record.

This Molecule's Sequence Is A Variant Of

The sequence in this database record is a variant of the record used to define this Molecule Page. For a sequence to be in the "is variant of" category, it must be in the same species group as that of the defining database record, must not be "is identical to", and must have a reference relating it to the "is defined by" record.

This Molecule's Sequence Is Related To

If a database record is referenced by one of the database records in the "is defined by", "is identical to", or "is variant of" categories, but the reference is not entirely trusted, that record is put in the "is related to" category.

This Molecule's Sequence Is Possibly Related To

These records are are related to any of the above record types by sequence identity, but not by any database reference. They also could be records that are related by database reference to another possibly related to sequence.

Protein Record Fields

Database

The name of the database that is the source of the protein record. Most of these are self-explanatory. The UniProt (SwissProt and TrEMBL) variants are noted variant sequences (either splice variants, variants, or conflicts) of the primary sequence in that record.

GI

The GI number of the protein database record at NCBI. SwissProt records have GI numbers (as they are part of the NCBI Entrez system), but TrEMBL records do not, nor do the UniProt variants. The GI numbers are hyperlinked to NCBI's Entrez browser.

NCBI often adds a methionine to the front of their version of a SwissProt sequence. When this occurs, the record is labeled, and both versions (leading methionine and no leading methionine) are listed. For the purposes of sequence identity, a sequence with a leading methionine is considered identical to a sequence lacking the leading methionine.

Accession

This is an identifier of the database record. When database records have multiple accessions, this field is the primary accession (i.e. the first one in the accession list). For UniProt sequences, the accession is hyperlinked to PIR's UniProt server. For Ensembl records, the accession is hyperlinked to the Protein Report page.

Entry

For UniProt records, the entry field is the unique identifier for the the record, though it is not as stable of an identifier as the accession. The entry field is hyperlinked to EBI's UniProt server. For Ensembl records, the Entry field contains the Ensembl Gene ID related to this record, and is hyperlinked to the Ensembl Gene Report page.

UniProt variants use an alternate form of the entry name, which enumerates the splice variant, variant, and conflicts given in the database record. A value of 0 means it is the standard form. For example, KAPA_MOUSE-1-0-0 refers to a splice variant, KAPA_MOUSE-0-0-1 refers to a sequence conflict, and KAPA_MOUSE-1-0-1 is a combination of the splice variant and the conflict.

Name

For the sequences with a GI number, this is the title used in the Fasta header in their non-redundant protein Fasta file. For UniProt sequences, it is a combination of the description (DE) and organism (OS) lines.

Length

The sequence length in number of amino acids.

Update

The last date the sequence database record was updated by the author in the given database (Genbank, SwissProt, etc.)

Species

The unique species name as given in the NCBI Taxonomy database. This field is hyperlinked to the NCBI Taxonomy browser, where one can see taxonomic hierarchy.

How Protein Records Are Obtained

Molecule Pages are defined by a unique protein sequence, generally one that is associated with a unique database record which NCBI has assigned a GI number. A combination of Entrez Gene (and MGD for mouse sequences), UniProt references, sequence, and taxonomy information is used to find related protein records.

The first step in finding relating protein database records is to relate the defining protein database accession or GI number to a Entrez Gene record, if possible. The Entrez Gene record is used to generate a list of known related records. UniProt references to EMBL translations are used as an additional mechanism to link the UniProt record to Entrez Gene. For mouse records, Mouse Genome Database (MGD) references in the UniProt records also can be used to link UniProt to Entrez Gene, because there is roughly a one-to-one correspondance between Entrez Gene mouse records and MGD records.

Taxonomy grouping is somewhat more complex than the use of a specific species (i.e. a specific NCBI Taxonomy ID). Because Entrez Gene refers to a few species variants in some of its mouse and rat records (for example, Mus pahari, Rattus rattus, etc.), we chose to use everything in the Mus genus to be part of the mouse sequence record group, and everything in the Rat genus to be part of the rat sequence record group.

Since we also have UniProt variants in our non-redundant protein database, we include all of them in the list of protein database records. This adds to the information known about all the related protein records (for example, knowing that a UniProt variant has the same sequence as a Refseq sequence). UniProt references are used to find PIR records that are related to the gene.

After the list of related records has been compiled, another pass of the analysis is made, by the use of sequence and species identity. All records with the same sequence and species of all the related records are found, and if any did not previously show up as being related, they are considered possibly related. All those records are then compared to Entrez Gene and UniProt in the same fashion as before.