Locus Nomenclature (Modified from the original PGSB AGI-codes section)
Designation of unique locus identifiers is performed as part of the genome sequence annotation at TAIR. The following section describes the syntax of chromosome based locus nomenclature and how locus identifiers are assigned. In some cases locus identifiers have been made obsolete. If you have information about a sequenced locus that has not been given a locus identifier, please contact email@example.com.
Guidelines for use of unique gene id's (modified from PGSB)
- Format of chromosomal based nomenclature
- AT (Arabidopsis thaliana)
- 1,2,3,4,5 (chromosome number) or M for mitochondrial or C for chloroplast.
- G (gene), other letters possible for repeats etc.)
- 12300 (five-digit code, numbered from top/north to bottom/south of chromosome)
- Chromosome based locus identifiers are assigned
- protein-coding genes
- RNA coding genes (sn, r, tRNAs)
- Chromosome based locus identifiers are not
- The first AGI locus identifier release made use of locus identifiers ending in zero, eg 10010, 10020, 10030 and so on so that intervening numbers could be used for newly discovered genes.
- Where there are gaps in the sequence, the first release skipped at least 200 codes for each 100 kb of gap.
- In the first release, some genes were present as fragments as they lie across the boundary of two BACS. Each fragment got its own locus identifier if there was no way to represent the whole gene. There gene fragments were merged into a single locus in later releases, and one of the AGI locus identifiers became obsolete.
Adding, deleting, editing,merging and splitting
- Adding new genes
- If there are free ATxGxxxx0 locus identifiers, TAIR assigns those first as in the rules above. If not, TAIR uses the last digit, leaving space as appropriate, i.e. ...5 if the new gene is in the middle or ...8 if it is close to the neighbor with higher identifier. If there are no free identifiers between the neighboring genes at all, we use the nearest free identifier. We will do our best not to disturb the sequential numbering of genes along the chromosome, but users should be aware that adjacent loci are often not in sequential order. This may be due to reorientation of BACS, or if genes are addedin an interval in which no sequential identifiers remain.
- Deleting genes:
- Deleted genes are kept in the database so they can be retrieved by searching for the identifier, but are marked "obsolete" and do not appear in database displays. Identifiers from deleted genes are not used again.
- Editing genes:
- Consensus in the AGI was that identifiers should be kept constant as long as there are no major changes in the gene model. As long as modifications in the gene model do not lead to a completely new protein (e.g. through use of a different reading frame), the identifier will be kept, even if exon boundaries change or individual exons are added/removed.
- Merging and splitting genes:
- Splitting Genes: When it is determined that a locus identifier actually refers to more than one gene (e.g. two genes were mistakenly predicted to be one gene), one of the genes will retain the original gene name and the second will get a new gene name. Rules for deciding which gene retains the original identifier is based on which gene contains the majority of sequence from the original locus.
- Merging Genes: In the case where experimental evidence is found to indicate that two genes are actually a single locus (e.g. a full length cDNA is obtained) the two locus entries will be merged into one and the name that corresponds to the locus with the majority of sequence will be retained. The second locus identifier will be made obsolete (but kept associated to the locus identifier of the merged gene.
- Notes about splits and merges will be kept as well as the different versions of the locus sequence. Versions are identified by locus identifier, source, and date. For example AT2G18190 later becomes split into two entries AT2G18190 and AT2G18193 with a note that indicates that the second entry resulted from a split of AT2G18190. You can search TAIR for the annotation Locus Histories and download lists of locus names that are obsolete or in use.Notes about splits and merges will be kept as well as the different versions of the locus sequence. Versions are identified by locus identifier, source, and date. For example AT2G18190 later becomes split into two entries AT2G18190 and AT2G18193 with a note that indicates that the second entry resulted from a split of AT2G18190. You can search TAIR for the annotation Locus Histories and download lists of locus names that are obsolete or in use.
- What terms in history tracking refer to:
- delete means a gene model has been eliminated
- merge means a gene model has been merged with another gene but retained old name
- mergedelete means a gene model has been merged but its name has not been retained
- insert means a gene model has been inserted from scratch
- split means a gene model has been split but has retained its name
- splitinsert means a gene model has been split and has a new name
- new means a gene model has been generated
- obsoleted means a gene model has disappeared
- The terms new and obsoleted may describe PGSB data when it is unknown if an insert or delete was due to a splitinsert or mergedelete.
- Other Notes:
Generally, the idea is to be as conservative as possible. The identifiers should identify a specific chromosome locus, not a particular protein, and even if this identifier is used in an old publication, it should still direct a user to the current annotation for that locus, so that he will be able to see that the protein sequence has changed in the meantime. This is preferable to having a new identifier after modifications, where the user will first have to look up what is the current annotation for this locus. Keeping backwards-compatible versions of all entries cannot be achieved, and identifiers should not be a way of "versioning" genes.
- Other Notes:
Important notes from PGSBMost people assume that if they sort the identifiers by ascending numbers they get a list of genes that represents the order along the chromosome. This was true originally, but no longer: Some BACS needed to be flipped, i.e. their orientation reversed, as new data on overlaps was generated. So all genes on these BACS now number the wrong way round. At PGSB, we decided it is more important to conserve the identifiers than the order, as the order can also be sorted by coordinates. Generally, the identifier still gives a good idea of the location on the chromosomes, only local reversals are expected. If you need a list of identifiers in the order along the chromosome, contact us. Once the orientation of BACS seems stable, this may be corrected by assigning new identifiers to the affected genes, as this will be more intuitive for users (This would be a breach of our "be conservative" rule, but the "be user-friendly" rule is more important).
Original document on creating AGI gene codes.
A uniform gene nomenclature system for Arabidopsis was
discussed at an impromptu meeting at GSAC in Miami
attended by Daphne Preuss, Chris Somerville, Claire
Fraser, Xiaoying Lin and Mike Bevan on Sept. 18th.
It was decided that the following uniform system will be used in the forthcoming publication of the sequence of chr 2 and chr 4. A rapid decision was needed due to the time needed to implement the new names.
AT =organism 1,2,3,4,5 =chromosome G =gene 00010 =gene id
The 'G' convention is useful as repeats (r) will soon be annotated, initially as markers. Pseudogenes will be numbered like functional genes. Gene are numbered in order from the top to bottom of the chromosomes. In the case of chr 2 and 4 this boundary is known due to the presence of rDNA clusters. Gene AT4G00010 is the first gene south of the cluster. Gene order is defined in units of 10 ie. 00010, 00020, 00030, etc allowing 9000 genes per chromosome. If new genes are found between two annotated genes, either by experiment or improved gene finding programs, these will be numbered as: 00010, 00012,3,4,-9. This give plenty of room for expansion. Different versions of a gene product, eg a differentially spliced gene , are denoted as 00010.1,2,3 etc. Where there are sequence gaps, often of uncertain size and content (eg CEN2 and CEN4), the sequence groups will leave a space the equivalent of 100 - 200 genes. Where the top arm telomeres have not yet been reached, a gap equivalent to about 50 genes should be left, ie numbering will start 05000, 05010, etc. The numbering of repeats will follow an independent system, where repeat ids are not interpolated between gene identities. Please don't worry that the BAC naming conventions will be lost or erased from the records. We realize these are presently the most commonly used names, therefore the databases will have a simple way of relating the two naming conventions. Note that a single "AT4G00650" gene can have two BAC names, due to overlaps, and this is one of the reasons for implementing the new nomenclature. You will be able to search for an individual gene with this new name. We believe this system conforms to that used in other organisms, and will be very useful to the community.
More informations and details on gene nomenclature can be found at TAIR