A consensus prediction system for prokaryotic CDSs

Contents

contact: suskang@kribb.re.kr



About CONSORF

Introduction

CONSORF: a consensus prediction system for prokaryotic open reading frames

While the number of known prokaryotic whole genomes is increasing rapidly, depending on the genome, their coding sequence (CDS) predictions are inconsistent. Moreover, it is difficult to systematically update them fast enough to keep up with new knowledge from the expanding public databases. To contribute to tackling these problems, we have developed CONSORF, an automatic identification system that provides comprehensive prokaryotic CDS information. It provides intuitive reliability scores, predicted frame-shifts, alternative start sites, and best pair-wise match information against other prokaryotes. CONSORF first predicts the CDSs supported by consensus alignments from multiple genome-to-proteome comparisons with other prokaryotes using the FASTX program. Then, it fills the empty genomic regions with the CDSs supported by consensus ab initio predictions. Currently, we provide comprehensive CDS information for 330 publicly available prokaryotes identified by CONSORF. Their accuracies, validated with the NCBI RefSeq CDSs, were comparable with other high-accuracy CDS prediction programs such as GeneLook and YACOP. In a large-scale comparative analysis, we expect that CONSORF’s homology-search feature, among prokaryotic genomes, will be able to save a great deal of time and provide us with consistently high quality prediction results. The regularly updated CDS predictions of prokaryotic genomes are freely accessible through our website.

Supplementary Information

For more detailed description on the CONSORF system, refer to this supplementary information .


CONSORF Workflow

CONSORF workflow

Figure 1. CONSORF workflow

From a prokaryotic genome sequence, the CONSORF system predicts CDSs in two complementary approaches: homology-based and algorithm-based. In the homology-based approach, pair-wise genome-to-proteome comparisons via the FASTX program are performed to generate both ‘homology CDSs’ and ‘alternative CDSs’ while multiple ab initio predictions are conducted to provide ‘ab initio CDSs’ in the algorithm-based approach. 'Homology CDSs' are determined from the representative FASTX alignment with the highest sum of bit scores in consensus analyses regarding stop, start, and frame change positions, while 'ab initio CDSs' are determined from the consensus of the algorithm-based CDSs with the highest sum of CDS nucleotide lengths in the consensus analyses regarding only stop and start positions. On the contrary, 'alternative CDSs' are directly determined from the FASTX alignments with the highest individual bit score across all the pair-wise comparisons. By integrating the complementary ‘homology CDSs’ and ‘ab initio CDSs’, avoiding a significant positional overlap on the genome, the 'integrated CDSs' were predicted with high accuracy. To determine the more likely start site among candidate starts, the ‘integrated CDSs’ aligned with N-terminal residues in the pair-wise FASTX comparisons were inspected to provide the final 'representative CDSs' .


CONSORF CDS Prediction: homology-based CDSs

CONSORF homology CDS example

Figure 2. An example of homology-based CDS prediction

[original image]

(A) A region on the fragment (from 2097001 to 2115000 base pairs) of the Bacillus subtilis (oi224308) genome (‘gi|50812173|ref|NC_000964.2|’) was best aligned in terms of the sum of bit scores with a Bacillus halodurans (oi2725558) protein (‘gi|15614769|ref|NP_243072.1|’) via a FASTX homology search across all the available organisms. Two frame changes composed of one base insertion (slash) and one base deletion (backslash) were detected (dotted arrows). Some parts of the FASTX alignment were omitted (points of ellipsis).

(B) The one-line header information of the predicted CDS in FASTA format. The header line starts with a close angle bracket, and each field shown over multiple lines for clarity only is separated by two spaces. The CDS ID (‘1r2107988’) represents a sequential number (‘1’) assigned to each genome of an organism from the longest to the shortest, a strand symbol (either ‘f’ for forward or ‘r’ for reverse), and the stop codon position (‘2107988’) of the CDS. The front part (‘2110811^2110796^2110784’) of the coordinate represents the positions of candidate starts. ‘2109969,2109969’ and ‘2109868,2109866’ in the middle represent frame-shift positions. ‘Strand’ is either forward (plus) or reverse (minus). ‘Amino acid length’ denotes both the sequence lengths from the shortest and the longest starts to the end positions. Reliability information is provided in three different types of consensus regarding stop only (type A: ‘54119.6/127’), stop and start only (type B: ‘23297/40’), and stop, start, and frame change (type C: ‘6462.4/8’). For each type of reliability information, the sum of bit scores and the number of occurrences are separated by a slash. The best bit score (‘1059.9’) of the representative FASTX alignment was also denoted at the end. The prefix ‘ex:’ was used to distinguish extrinsically-predicted ‘homology CDSs’ (‘ex:’) from intrinsically predicted ‘ab initio CDSs’ (‘in:’). The consensus-based best hit information from the representative FASTX alignment is composed of four fields: organism ID (‘oi272558’), gene ID (‘gi|15614769|ref|NP_243072.1|’), gene description (‘oxoglutarate dehydrogenase’), and organism name (‘Bacillus halodurans C-125’).

(C) The amino acid sequence of the predicted CDS. The sequences from the shortest start to the end enclosing the representative FASTX alignment are in capital letters. The candidate start sites are also represented in upper case, while those upstream of the start sites are in lower case. Most internal residues of the sequence were covered by the representative FASTX alignment (horizontal arrows), and three candidate starts were suggested accordingly.



CONSORF Browser

CONSORF browser example

Figure 3. Screenshot of the CONSORF browser

[original image]

It displays the predicted Helicobacter pylori 26695 CDSs in the genomic region from 885,000 to 911,000 base pairs. Most of the predicted CDSs are consistent with public CDSs with some minor variations. Potential frameshifts and candidate start sites are represented by vertical bars or blue and red, respectively.

(A) All the CDSs including ‘homology CDS’, ‘alternative CDS’, ‘ab initio CDS’, ‘integrated CDS’, ‘representative CDS’, and ‘public CDS’ were consistent. (B) The density of color represents CDS reliability based on homology-based and algorithm-based consensus. The CDSs in (B) had lower reliability scores than (A). (C) One base insertion near the start position extended the homology-based CDSs that had reliability scores comparable to those in (A). It needs further manual inspection for its authentic frame-shift or sequencing error. Instead of the frame-shifted CDS, one additional short CDS was predicted in ‘ab initio CDS’ and ‘public CDS’. (D) Four candidate starts were found in ‘homology CDS’, ‘alternative CDS’, ‘ab initio CDS’, ‘integrated CDS’, and ‘representative CDS’. However, the shortest starts were consistent with the start of ‘public CDS’. (E) Homology-based CDS was not found in this case. The longest start among the candidate starts of ‘ab initio CDS’, ‘integrated CDS’, and ‘representative CDS’ was consistent with the start of ‘public CDS’. (F) The detailed information on the clicked ‘representative CDS’ was displayed.



CONSORF Data and Formats

Organism ID

  • The organism ID of CONSORF is composed of an NCBI taxonomy ID and the prefix, 'oi'.
  • If there exist multiple organisms with the same NCBI taxonomy ID, a dot (.) followed by a serial number was attached in alphabetical order of their organism names.
  • Example
  • ID: ORGANISM_DESCRIPTION
    oi83333: Escherichia coli K12
    oi224308: Bacillus subtilis subsp. subtilis str. 168
    oi1148: Synechocystis sp. PCC 6803
    oi279010: Bacillus licheniformis ATCC 14580
    oi279010.1: Bacillus licheniformis ATCC 14580 (DSM 13)


Organism Name

  • The organism name of CONSORF is unique and the same as the directory name of the NCBI FTP site at this URL, ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria The organism name of CONSORF contains no space character. .
  • Example
  • ORGANISM_NAME: ORGANISM_DESCRIPTION
    Escherichia_coli_K12: Escherichia coli K12
    Bacillus_subtilis: Bacillus subtilis subsp. subtilis str. 168
    Synechocystis_PCC6803: Synechocystis sp. PCC 6803
    Bacillus_licheniformis_ATCC_14580: Bacillus licheniformis ATCC 14580
    Bacillus_licheniformis_DSM_13: Bacillus licheniformis ATCC 14580 (DSM 13)


Genome (Chromosome or Plasmid) ID

  • Genome ID is the unique ID of a chromosome or a plasmid.
  • It is usually composed of 'gi|' sign, a GenBank ID, '|ref|' sign, and a GenBank accession number including the version information followed by a '|' sign.
  • Example: Genomes (chromosomes or plasmids) of Synechocystis sp. PCC 6803
  • gi|16329170|ref|NC_000911.1|
    gi|38505535|ref|NC_005229.1|
    gi|38505825|ref|NC_005232.1|


CDS ID

  • The CDS ID of CONSORF is composed of a sequential number assigned to each genome including chromosomes and plasmids from the longest genome to the shortest, a strand symbol ('f' for forward strand and 'r' for reverse strand), and the stop codon position of the CDS.
  • Example: CDSs of Synechocystis sp. PCC 6803
  • CDS_ID: CDS_FULL_NAME(=CDS_POSITION/GENOME_ID)
    1f2065521: 2062863^2062905-2065520/gi|16329170|ref|NC_000911.1|
    1r1916535: 1918608-1916536/gi|16329170|ref|NC_000911.1|
    1f3566647: 3564838-3566646/gi|16329170|ref|NC_000911.1|
    2f25953: 24777^24798-25952/gi|38505535|ref|NC_005229.1|
    2r77194: 78010^77995^77941-77195/gi|38505535|ref|NC_005229.1|
    3f42309: 39171-42308/gi|38505825|ref|NC_005232.1|
    3r57135: 57666-57136/gi|38505825|ref|NC_005232.1|


CDS Full Name

  • The CDS full name of CONSORF contains the CDS position and genome ID separated by a slash sign ('/'). The CDS position is composed of one or more candidate start codon positions separated by a hat symbol ('^'), a dash symbol ('-'), and its nucleotide end position not including the stop codon. If an CDS is predicted to have one or more frame changes including insertions, deletions, or in-frame stop codons, the ranges of exactly translated nucleotide positions are denoted in series separated by a comma (',').
  • Example: CDSs of Synechocystis sp. PCC 6803
  • CDS_FULL_NAME(=CDS_POSITION/GENOME_ID): CDS_ID
    2062863^2062905-2065520/gi|16329170|ref|NC_000911.1|: 1f2065521
    1918608-1916536/gi|16329170|ref|NC_000911.1|: 1r1916535
    3564838-3566646/gi|16329170|ref|NC_000911.1|: 1f3566647
    24777^24798-25952/gi|38505535|ref|NC_005229.1|: 2f25953
    78010^77995^77941-77195/gi|38505535|ref|NC_005229.1|: 2r77194
    39171-42308/gi|38505825|ref|NC_005232.1|: 3f42309
    57666-57136/gi|38505825|ref|NC_005232.1|: 3r57135
    853281-852919,852919-852473/gi|16329170|ref|NC_000911.1|: 1r852472


CDS Coordinate

  • CDS Coordinate of CONSORF is the genomic position information of an CDS. It includes such information as 'strand', 'end', 'longest start', 'shortest start', 'candidate starts', 'longest amino-acid length', and 'shortest amino-acid length'.
  • 'Strand' describes the direction of translation, '+' for forward and '-' for reverse translations.
  • 'End'is the end position of an CDS not including the stop codon.
  • 'Longest start' is the longest (leftmost) start position of an CDS.
  • 'Shortest start' is the shortest (rightmost) start position of an CDS at least positionally covering the representative FASTX alignment (i.e. the representative conserved regions). It is possible that the 'shortest start' is the same as the 'longest start' if only the 'longest start' covers the representative FASTX alignment.
  • 'Candidate starts' are all the start codons positioned between the longest and the shortest starts inclusively. If the 'longest start' is the same as the 'shortest start', the number of candidate start is only one. If there exists only the 'longest start' and the 'shortest start' and they are different, the number of candidate starts is two, etc.
  • 'Longest amino-acid length' is the number of amino-acid residues from the longest start to the end positions.
  • 'Shortest amino-acid length' is the number of amino-acid residues from the shortest start to the end positions.
  • Only 'ATG', 'GTG', and 'TTG' codons were considered as possible start codons.
  • Stop code was determined from the translation table. Usually, 'TAA', 'TAG', and 'TGA' codons were considered as possible stop codons.
  • Example: CDSs of Synechocystis sp. PCC 6803
  • CDS_ID (CDS_FULL_NAME): STRAND END LONGEST_START SHORTEST_START CANDIDATE_STARTS(separated by a hat sign)
    1f2065521 (2062863^2062905-2065520/gi|16329170|ref|NC_000911.1|): + 2065520 2062863 2062905 2062863^2062905
    1r1916535 (1918608-1916536/gi|16329170|ref|NC_000911.1|): - 1916536 1918608 1918608 1918608
    1r852472 (853281-852919,852919-852473/gi|16329170|ref|NC_000911.1|): - 852473 853281 853281 853281


Frame Change of CONSORF

  • Frame change of CONSORF includes insertion, deletion, and in-frame stop events.
  • Insertion is a frame-shift event caused by one base insertion.
  • Deletion is a frame-shift event caused by one base deletion.
  • 'In-frame stop' is a stop codon read-through event caused by translation of a stop codon as an amino acid such as tryptophan (Trp, W).
  • 'Multi-frame change' of CONSORF describes more than one frame change events occurring at the same genomic location theoretically expected from a FASTX alignment.
  • Example: CDSs of Synechocystis sp. PCC 6803 expected to contain one or more frame changes
  • EXPECTED_FRAME_CHANGE: CDSID (CDS_FULL_NAME)
    One insertion: 1r852472 (853281-852919,852919-852473/gi|16329170|ref|NC_000911.1|)
    One deletion: 1f2205370 (2204157-2204573,2204575-2205369/gi|16329170|ref|NC_000911.1|)
    One in-frame stop: 1f1156501 (1154857-1156080,1156084-1156500/gi|16329170|ref|NC_000911.1|)


Consensus (Reliability) of CONSORF

  • 'Consensus' or 'reliability' of CONSORF is the information on the degree of consensus (reliability) for an CDS. If the CDS was identified by a homology-based method, then the consensus (reliability) describes how many high-scoring FASTX alignments from the comparisons with other public prokaryotes support the CDS. If the CDS was identified by a algorithm-based ab initio method, then the consensus (reliability) describes how many ab initio gene prediction programs support the CDS.
  • 'Consensus' or 'reliability' of CONSORF includes 'method', 'stop score', 'stop count', start score', 'start count', 'exact score', 'exact count', and 'best score'.
  • 'Method' describes the identification method of an CDS. It is either homology-based extrinsic ('ex') or ab initio intrisic ('in') methods.
  • 'Stop score' describes the reliability (consensus score) of stop codon prediction of an CDS. If the CDS is from homology-based method, it represents the total sum of bit scores from the FASTX alignments supporting the stop codon of the CDS. If the CDS is from algorithm-based ab initio method, it represents the total sum of nucleotide lengths of CDSs from the ab initio predictions supporting the stop codon of the CDS.
  • 'Stop count' also describes the reliability (consensus score) of stop codon prediction of an CDS roughly. If the CDS is from homology-based method, it represents the total number of such FASTX alignments supporting the stop codon of the CDS. If the CDS is from algorithm-based ab initio method, it represents the total number of such ab initio predictions supporting the stop codon of the CDS.
  • 'Start score' describes the reliability (consensus score) of stop and start codon predictions of an CDS. If the CDS is from homology-based method, it represents the total sum of bit scores from the FASTX alignments supporting the start and stop codons of the CDS. If the CDS is from algorithm-based ab initio method, it is the same as 'exact score' and usually omitted.
  • 'Start count' also describes the reliability (consensus score) of stop and start codon predictions of an CDS roughly. If the CDS is from homology-based method, it represents the total number of such FASTX alignments supporting the start and stop codons of the CDS. If the CDS is from algorithm-based ab initio method, it is the same as 'exact count' and usually omitted.
  • 'Exact score' describes the reliability (consensus score) of stop, start, and frame change (exact) predictions of an CDS. If the CDS is from homology-based method, it represents the total sum of bit scores from the FASTX alignments supporting the start, stop, and frame change of the CDS exactly. If the CDS is from algorithm-based ab initio method, it represents the total sum of nucleotide lengths of CDSs from the ab initio predictions supporting the stop and start codon of the CDS exactly. Frame change is not considered in the ab initio predictions.
  • 'Exact count' also describes the reliability (consensus score) of stop, start, and frame change (exact) predictions of an CDS roughly. If the CDS is from homology-based method, it represents the total number of such FASTX alignments supporting the start, stop, and frame change of the CDS exactly. If the CDS is from algorithm-based ab initio method, it represents the total number of such ab initio predictions supporting the stop and start codon of the CDS exactly. Frame change is not considered in the ab initio predictions.
  • 'Best score' is the bit score of the representative FASTX alignment with the highest (best) bit score among FASTX alignments supporting the same stop, start, and frame change of the CDS exactly. 'Best score' is assigned only to homology-based CDSs. It is omitted in the algorithm-based ab initio CDSs.
  • 'Old starts' describes old candidate start codons of an CDS before the refinment of its shortest start. Only the 'representative CDSs' undergone start refinement may have 'old starts'. If no better start codon is found during start refinement, 'old starts' information is omitted.
  • Example
  • METHOD('ex'): STOP_SCORE/STOP_COUNT/START_SCORE/START_COUNT/EXACT_SCORE/EXACT_COUNT/BEST_SCORE
    or
    METHOD('in'): STOP_SCORE/STOP_COUNT/EXACT_SCORE/EXACT_COUNT
    ex:231688.2/292/216693.4/260/215633/258/1183.1
    ex:162691.1/319/162691.1/319/162691.1/319/668.9
    ex:2505.8/12/1981.2/6/1981.2/6/424.8
    in:10410/2/5163/1
    in:7227/3/7227/3
    in:1512/2/1512/2


Best match of CONSORF

  • 'Best match' of CONSORF is the best-match information from the representative FASTX alignment from FASTX homology search against other public proteomes of prokaryotes. It is assigned only to homology-based CDSs.
  • 'Best match' of CONSORF includes a best-match organism ID, a best-match organism description, a best-match protein ID, and a best-match protein description.
  • Example
  • BEST_MATCH_ORG_ID:BEST_MATCH_PROTEIN_ID: BEST_MATCH_PROTEIN_DESC (BEST_MATCH_ORG_DESC)
    oi240292:gi|75908552|ref|YP_322848.1|: ATPase (Anabaena variabilis ATCC 29413)
    oi251221:gi|37521338|ref|NP_924715.1|: carbamoyl-phosphate synthase large subunit (Gloeobacter violaceus PCC 7421)
    oi103690:gi|17230263|ref|NP_486811.1|: dihydroxy-acid dehydratase (Nostoc sp. PCC 7120)
    oi316279:gi|78183707|ref|YP_376141.1|: translocase (Synechococcus sp. CC9902)



CONSORF Files and Formats

[Genome] Genome Sequence File

  • A genome sequence file of an organism contains genomic nucleotide sequences of available chromosomes and plasmids in multi-FASTA format.
  • The file format is the same as the NCBI genome file with the file extension, '.fna'.

[Proteome] Proteome Sequence File

  • A proteome sequence file of an organism contains all the amino-acid sequences of predicted genes provided by NCBI RefSeq.
  • The file format is the same as the NCBI proteome file with the file extension, '.faa'.
  • Proteome sequence files were used as library (database) files of the FASTX program in pairwise homology search.

Public CDS Files

General information

  • Public CDS files contain the public coding sequence (CDS) information from NCBI Refseq files.
  • Only the CDSs with a stop codon (one of 'TAA', 'TAG', and 'TGA') at the C-terminal end were extracted from the NCBI GenBank-format files with the extension '.gbk'.
  • Coordinate file: tab-delimited information of each public CDS containing related gene IDs, genomic coordinate information, and a predicted protein product.
  • Amino-acid sequence file: translated amino-acid sequences of the above public CDSs in the coordinate file.
  • Nucleotide sequence file: genomic nucleotide sequences of the above public CDSs in the coordinate file.
  • Coordinate files containing the information of public CDSs were referenced to evaluate CONSORF-predicted CDSs including 'homology CDSs', 'alternative CDSs', 'integrated CDSs', and 'representative CDSs'.

[Coord] Coordinate file

  • Format
  • OrfID/GenomeID <tab> GenomicCoordinate <tab> Strand <tab> ProteinID <tab> DbXrefs <tab> GeneSymbol <tab> Product
    • 'OrfID' is usually the locus tag of an CDS. If the locus tag is not available, the 'GenomicCoordinate' of the CDS is used instead.
    • 'GenomeID' is either a chromosome ID or a plasmid ID.
    • 'GenomicCoordinate' is usually composed of a start position, a dash, and an end position. Sometimes, it is composed of a few fragmented genomic coordinates (each with a start position, a dash, and an end position) separated by a comma.
    • 'GenomicCoordinate' of a public CDS includes the stop codon position contrary to CONSORF-predicted CDSs excluding the stop codon.)
    • 'Strand' is either a plus or a minus sign.
    • 'ProteinID' is the GenBank accession ID of an CDS.
    • 'DbXrefs' is one or more externally linked database references separated by a semicolon. Each database reference is composed of a database name, a colon, and a database ID.
    • 'GeneSymbol' is the gene symbol of an CDS. If not available, a dash was inserted instead.
    • 'Product' is the predicted gene product of an CDS.
  • Example: 'public CDSs' of Synechocystis sp. PCC 6803
  • slr0611/gi|16329170|ref|NC_000911.1| 3573271-3573470,1-772 + NP_439899.1 GI:16329171;GeneID:951850 sds solanesyl diphosphate synthase
    slr0612/gi|16329170|ref|NC_000911.1| 937-1494 + NP_439900.1 GI:16329172;GeneID:951851 - hypothetical protein
    sll1212/gi|16329170|ref|NC_000911.1| 6622-5534 - NP_439905.1 GI:16329177;GeneID:951882 rfbD GDP-D-mannose dehydratase
    ssl5001/gi|38505535|ref|NC_005229.1| 374-195 - NP_942157.1 GI:38505536;GeneID:2655889 - hypothetical protein

[AaSeq] Amino-acid sequence file

  • Format
  • >OrfID/GenomeID <two_spaces> ProteinID <two_spaces> DbXrefs <two_spaces> GeneSymbol <two_spaces> Product
    Amino acid sequence
  • Example: 'public CDSs' of Synechocystis sp. PCC 6803
  • >slr0611/gi|16329170|ref|NC_000911.1| NP_439899.1 GI:16329171;GeneID:951850 sds solanesyl diphosphate synthase
    MISTTSLFAPVDQDLRLLTDNLKRLVGARHPILGAAAEHLFEAGGKRVRPAIVLLVSRATLLDQELTARHRRLAEITEMI
    HTASLVHDDVVDEADLRRNVPTVNSLFDNRVAVLAGDFLFAQSSWYLANLDNLEVVKLLSEVIRDFAEGEILQSINRFDT
    DTDLETYLEKSYFKTASLIANSAKAAGVLSDAPRDVCDHLYEYGKHLGLAFQIVDDILDFTSPTEVLGKPAGSDLISGNI
    TAPALFAMEKYPLLGKLIEREFAQAGDLEQALELVEQGDGIRRSRELAANQAQLARQHLSVLEMSAPRESLLELVDYVLG
    RLH
    >slr0612/gi|16329170|ref|NC_000911.1| NP_439900.1 GI:16329172;GeneID:951851 - hypothetical protein
    MGRLDQDSEGLLLLTSNGKLQHRLAHREFAHQRTYFAQVEGSPTDEDLEPLRRGITFADYPTRPAIAKIITEPDFPPRNP
    PIRYRASIPTSWLSITLTEGRNRQVRRMTAAVGFPTLRLVRVQIQVTGRSPQQGKGKSAATWCLTLEGLSPGQWRPLTPW
    EENFCQQLLTGNPNGPWQKKFGDRR

[NtSeq] Nucleotide sequence file

  • Format
  • >OrfID/GenomeID <two_spaces> GenomicCoordinate <two_spaces> Strand
    Nucleotide sequence
  • Example: 'public CDSs' of Synechocystis sp. PCC 6803
  • >slr0611/gi|16329170|ref|NC_000911.1| 3573271-3573470,1-772 +
    ATGATCTCCACTACCTCCCTGTTTGCCCCCGTTGACCAAGACCTCCGTTTATTAACGGATAATCTCAAGCGGCTTGTCGG
    TGCTCGGCATCCTATCCTGGGGGCGGCGGCGGAACATTTATTTGAGGCAGGGGGAAAGCGGGTGCGGCCGGCCATTGTGT
    TGTTAGTTTCCCGCGCAACCCTATTAGACCAAGAATTAACGGCGCGCCATCGCCGGCTGGCGGAAATTACCGAAATGATC
    CACACCGCTAGTTTGGTCCACGATGACGTGGTGGATGAGGCGGATCTGCGGCGGAATGTGCCCACGGTGAATAGTTTATT
    TGACAATCGGGTGGCAGTGTTAGCGGGGGATTTCCTCTTTGCCCAATCTTCTTGGTATTTGGCTAACTTAGATAATTTGG
    AGGTGGTGAAATTATTATCGGAGGTAATTCGGGACTTTGCGGAGGGGGAAATTTTACAGAGCATCAATCGTTTTGACACC
    GACACAGATTTAGAAACCTATTTGGAAAAAAGCTATTTTAAAACCGCCTCTCTCATTGCCAACAGTGCCAAGGCAGCGGG
    GGTTTTGAGCGATGCGCCCCGGGATGTGTGTGATCATCTTTACGAATATGGTAAACATTTGGGGTTAGCGTTCCAGATTG
    TGGACGATATTTTAGATTTCACTTCCCCCACGGAGGTTTTGGGGAAACCGGCCGGGTCAGATTTAATCAGCGGCAACATC
    ACCGCCCCAGCCCTATTTGCCATGGAAAAATATCCCCTACTTGGTAAATTAATTGAACGGGAATTTGCCCAGGCGGGGGA
    TTTGGAACAGGCCCTGGAATTGGTAGAACAGGGGGATGGTATCCGGCGATCAAGGGAATTGGCCGCGAACCAAGCGCAAC
    TGGCCCGGCAACATCTGAGTGTGCTGGAAATGTCCGCTCCGAGAGAATCTCTGTTGGAATTAGTTGATTATGTGCTTGGT
    CGTCTCCATTAG
    >sll0558/gi|16329170|ref|NC_000911.1| 2873-2172 -
    ATGTCTGATAATTTGACCGAACTCTCCCAACAACTCCATGATGCTTCAGAAAAAAAACAGTTGACGGCGATCGCCGCTTT
    GGCAGAAATGGGAGAAGGGGGCCAGGGAATATTACTCGATTATTTGGCCAAAAATGTCCCCCTAGAAAAGCCAGTGTTGG
    CGGTGGGTAACGTCTACCAAACCCTCCGGAATCTAGAACAGGAAACCATCACAACGCAACTCCAACGGAATTACCCCACA
    GGCATTTTCCCCTTACAATCGGCCCAGGGCATTGATTATCTGCCGCTCCAGGAAGCCCTAGGAAGCCAGGATTTTGAAAC
    AGCGGATGAAATAACCCGGGATAAATTGTGCGAACTGGCGGGGCCTGGGGCCAGTCAAAGACAATGGCTCTATTTCACAG
    AAGTAGAAAAATTTCCTGCCCTAGACCTGCACACCATTAATGCTTTGTGGTGGCTCCACTCCAACGGTAATTTTGGTTTT
    TCGGTGCAACGACGACTCTGGTTGGCGTCCGGAAAAGAATTTACCAAGCTTTGGCCGAAAATTGGCTGGAAAAGCGGCAA
    TGTTTGGACCCGTTGGCCCAAGGGCTTTACCTGGGATTTATCCGCACCCCAGGGTCATTTACCCCTGTTAAACCAATTGC
    GGGGGGTAAGGGTAGCAGAATCCCTTTACAGGCACCCAGTTTGGTCCCAATACGGTTGGTAA



[XML] CONSORF XML File

  • XML Schema of CONSORF : consorf.xsd
  • Simplified Display of CONSORF XML Format
  • <consorf>
      <orf>
        <orfId>string</orfId>
        <orfFullName genomeId="string" orfPosition="string">
        <coordinate
          strand="+ or -" end="positiveInteger"
          candidateStarts="list of positiveInteger"
          shortestStart="positiveInteger"
          longestStart="positiveInteger"
          shortestAaLen="positiveInteger"
          longestAaLen="positiveInteger" />
        <frameChange(optional)
          insertion="list of positiveInteger"
          deletion="list of positiveInteger"
          inFrameStop="string(list of coordinate ranges)"
          multiFrameChange="string(list of coordinate ranges)" />
        <consensus
          method="in or ex"
          stopScore="decimal" stopCount="positiveInteger"
          exactScore="decimal" exactCount="positiveInteger"
          startScore(optional)="decimal"
          startCount(optional)="positiveInteger"
          bestScore(optional)="decimal"
          oldStarts(optional)="list of positiveInteger" />
        <bestMatch(optional) bestOrgId="string" bestOrgDesc="string"
          bestProteinId="string" bestProteinDesc="string" />
        <startAaSeq(optional)>string</startAaSeq>
        <shortestAaSeq>string</shortestAaSeq>
        <startNtSeq(optional)>string</startNtSeq>
        <shortestNtSeq>string</shortestNtSeq>
      </orf>
      ...
    </consorf>
  • Elements and Attributes of CONSORF XML
    • 'orfId' is the unique ID of an CDS. Refer to 'CDS ID' for more information.
    • 'orfFullName' has two atributes, 'orfPosition' and 'genomeId'. Refer to 'CDS Full Name' for more information.
    • 'coordinate' has seven atributes, 'strand', 'end', 'candidateStarts', 'shortestStart', 'longestStart', 'shortestAaLen', and 'longestAaLen'. Refer to 'CDS Coordinate' for more information.
      • 'candidateStarts' is the list of candidate start positions of an CDS separated by a space.
      • <example>
        392 446 509 533 545 662 800 827 931 1144 7835

    • 'frameChange' has four atributes, 'insertion', 'deletion', 'inFrameStop', and 'multiFrameChange'. Refer to 'Frame Change of CONSORF' for more information.
      • 'insertion' and 'deletion' is the list of genomic nucleotide positions separated by a space.
      • 'inFrameStop' and 'multiFrameChange' is the list of genomic nucleotide position ranges separated by a space.
      • <example>
        inFrameStop: 10290-10288 inFrameStop: 10541-10543 10607-10609 multiFrameChange: 1470704-1470699 multiFrameChange: 2756354-2756362

    • 'consensus' has nine atributes, 'method', 'stopScore', 'stopCount', 'startScore', 'startCount', 'exactScore', 'exactCount', 'bestScore', and 'oldStarts'. Refer to 'Consensus (Reliability) of CONSORF' for more information.
      • 'oldStarts' is the list of old candidate start positions before start refinment separated by a space.
      • <example>
        931 1144 1150 19890 19971 19974 19986 20025 20040 20067 20070

    • 'bestMatch' has four atributes, 'bestOrgId', 'bestOrgDesc', 'bestProteinId', and 'bestProteinDesc'. Refer to 'Best match of CONSORF' for more information.
    • 'startAaSeq' is the amino-acid sequence from the longest (inclusive) to the shortest (exclusive) start positions. Only the start positions are displayed as capital letters.
    • <example>
      Mqpkqravp
      MpflqciMrssLyfrakspgyla

    • 'shortestAaSeq' is the amino-acid sequence from the shortest start (inclusive) to the end (inclusive) positions. All amino acids are capital letters.
    • 'startNtSeq' is the nucleotide sequence from the longest (inclusive) to the shortest (exclusive) start positions. Only the start codons are capital letters.
    • <example>
      GTGcaacccaagcagagggctgtcccc
      GTGccctttctccaatgcattATGcgatcgtccTTGtattttcgagctaaatctcccggttatcttgcc

    • 'shortestNtSeq' is the nucleotide sequence from the shortest start (inclusive) to the stop (inclusive) positions. All nucleotides are capital letters.

[RankedAaSeq] Ranked Amino-acid Sequence File

  • Format
  • Case 1: homology-based CDSs
    >'orfId'
    <two_spaces> 'orfFullName'
    <two_spaces> 'strand'
    <two_spaces> 'shortestAaLen'~'longestAaLen'aa
    <two_spaces> 'method':'stopScore'/'stopCount'/'startScore'/'startCount'/'exactScore'/'exactCount'/'bestScore'(start:'oldStarts')
    <two_spaces> 'bestOrgId':'bestProteinId': 'bestProteinDesc' ('bestOrgDesc')
    'startAaSeq''shortestAaSeq'
    Case 2: algorithm-based ab initio CDSs
    >'orfId'
    <two_spaces> 'orfFullName'
    <two_spaces> 'strand'
    <two_spaces> 'shortestAaLen'~'longestAaLen'aa
    <two_spaces> 'method':'stopScore'/'stopCount'/'exactScore'/'exactCount'(start:'oldStarts')
    'startAaSeq''shortestAaSeq'
    • Refer to CONSORF Data and Formats for the description on each field.
    • Entries were sorted by 'stopScore' in descending order.
    • For 'ab initio CDSs', those attributes as 'startScore','startCount','bestScore','bestOrgId','bestProteinId','bestProteinDesc', and 'bestOrgDesc' are all omitted.
    • If the shortest start position is adjusted during start refinement process and hence the candidate starts are changed, the old candidate start information is attached after the consensus score information since not the new but the old candidate starts are related with the consensus score information.
    • <example>
      ex:221560/244/87344.8/59/87344.8/59/1989.1
      ex:127930.9/242/85347.8/158/85347.8/158/966.6(start:599082^599067^599064)
      in:5358/2/2355/1
      in:4140/2/2061/1(start:600145^600148^600163)

  • Example: CDSs of Synechocystis sp. PCC 6803
  • >1f2065521 2062863^2062905-2065520/gi|16329170|ref|NC_000911.1| + 872~886aa ex:231688.2/292/216693.4/260/215633/258/1183.1 oi240292:gi|75908552|ref|YP_322848.1|: ATPase (Anabaena variabilis ATCC 29413)
    MvvlthpianienfMQPTDPNKFTEKAWEAIAKTPEIAKQHRQQQIETEHLLSALLEQNGLATSIFNKAGASIPRVNDQV
    NSFIAQQPKLSNPSESIYLGRSLDKLLDNAEIAKSKYGDDYISIEHLMAAYGQDDRLGKNLYREIGLTENKLAEIIKQIR
    GTQKVTDQNPEGKYESLEKYGRDLTELAREGKLDPVIGRDEEVRRTIQILSRRTKNNPVLIGEPGVGKTAIAEGLAQRII
    NHDVPESLRDRKLISLDMGALIAGAKYRGEFEERLKAVLKEVTDSQGQIILFIDEIHTVVGAGATQGAMDAGNLLKPMLA
    RGALRCIGATTLDEYRKYIEKDAALERRFQEVLVDEPNVLDTISILRGLKERYEVHHGVKIADSALVAAAMLSNRYISDR
    FLPDKAIDLVDEAAAKLKMEITSKPEELDEVDRKILQLEMERLSLQRENDSASKERLEKLEKELADFKEEQSKLNGQWQS
    EKTVIDQIRTVKETIDQVNLEIQQAQRDYDYNKAAELQYGKLTDLQRQVEALETQLAEQQTSGKSLLREEVLESDIAEII
    SKWTGIPISKLVESEKEKLLHLEDELHSRVIGQDEAVTAVAEAIQRSRAGLSDPNRPTASFIFLGPTGVGKTELAKALAK
    NLFDTEEALVRIDMSEYMEKHAVSRLMGAPPGYVGYEEGGQLTEAIRRRPYSVILFDEIEKAHGDVFNVMLQILDDGRLT
    DAQGHVVDFKNTIIIMTSNLGSQYILDVAGDDSRYEEMRSRVMDVMRENFRPEFLNRVDETIIFHGLQKSELRSIVQIQI
    QSLATRLEEQKLTLKLTDKALDFLAAVGYDPVYGARPLKRAVQKYLETAIAKGILRGDYKPGETIVVDETDERLSFTSLR
    GDLVIV
    >1r1916535 1918608-1916536/gi|16329170|ref|NC_000911.1| - 691aa ex:219901/318/179591.8/240/179591.8/240/1078.4 oi59919:gi|33862065|ref|NP_893626.1|: elongation factor EF-2 (Prochlorococcus marinus subsp. pastoris str. CCMP1986)
    MARTVPLERIRNIGIAAHIDAGKTTTTERILFYSGVVHKIGEVHEGTAVTDWMAQERERGITITAAAISTDWLGHHINII
    DTPGHVDFTIEVERSMRVLDGVIAVFCSVGGVQPQSETVWRQAERYQVPRIAFVNKMDRTGANFFRVCQQIGDRLRANAV
    PVQIPIGSEAEFEGIVDLVRMKAYLYKNDLGTDIQEVPIPDSVKDKTEEYRLRLVESVAEADDALMEKYLEGEELTADEL
    VAGLRRGTIAGTMVPVLCGSAFKNKGVQLLLDAVVDYLPSPLEVPAIEGHLPDGEVATRPAEDKAPLSALAFKVMADPFG
    RLTFVRVYSGVLEKGSYVLNSTKEKKERISRLIILKADDRIEVDQLNAGDLGAVLGLKDTLTGDTLCDDQEPIILESLFV
    PQPVISVAVEPKTKQDMDKLSKALQSLSEEDPTFRVSVDPETNQTVIAGMGELHLEILVDRMLREFKVEANVGAPQVAYR
    ETIRKAVQAEGKFIRQSGGKGQYGHVVIEVEPTEPGTGFEFVSKIVGGVIPKEYIAPSEQGMKEACASGVLAGYPVIDLK
    ATLVDGSFHDVDSSEMAFKIAGSMAIREAVGQADPVLLEPVMKVEIEVPDDFMGNVIGDLNARRGHIEGQETEQGIAKVA
    ASVPLAEMFGYATDIRSKTQGRGIFSMEFSHYAEVPRNVAEAIVAKSRGYA
    >1f1156501 1154857-1156080,1156084-1156500/gi|16329170|ref|NC_000911.1| + 547aa ex:28344.6/179/3913.9/16/3913.9/16/631.0 oi103690:gi|17227685|ref|NP_484233.1|: hypothetical protein alr0189 (Nostoc sp. PCC 7120)
    MFALPQAGDRRGEIIKVLLSNGWDYMNGLLTLGKVGEPQIPTPEVLTKILVELGPFYIKLGQLLSTRPDLLPPRYINALT
    ALQSNVPPLPWSAIEDLLQREFPQPLGETFQEIESEPIAAGSIGQIHRAVLQSGETVAIKVKRPGIDVIVEQDSLLIKDV
    AELLALTEFGQNYDIVKLADEFTQTVKAELNFDTEAAYTNNLRTNLAKTTWFDPNQLVIPKVYWELTNQKFLVLEWLDGV
    PILTADLTQPPSDKDIAEKKKEITTLLFRAFFQQLYVDGFFHADPHPGNIFYLADGRLALIDCGMVGRLDPRTRQLLTEM
    LLAIVDLDAKRCAQLTVELSESVGRVNFQRLEVDYERMLRKYYDLSLSEFNFSEVVYEFLRIARVNKLKVPACLGLYAKC
    LANLEGAGQFNPELNLFTEINPLITDLFRRQLFGTNPLQTALRTVLDLKAVSLKTPRQMDVLLDRLTTETLQWNVRLEGL
    EPVRRTIDKSANRLSFSIVLGSLIMGAAILSTGNDQQLTLIANILFVAATVIGFWLVISILRSGRLK
    >1f373642 372796^372811-373164,373168-373641/gi|16329170|ref|NC_000911.1| + 276~281aa ex:2897.5/48/207.8/3/207.8/3/88.0 oi243090:gi|32471992|ref|NP_864986.1|: conserved hypothetical protein-putative methyltransferase (Rhodopirellula baltica SH 1)
    MykniMKTTINDYIGQFIKTTPEFKGKWRIIRYWMNQNKDHRTKYRILPGGEKILCDLSIPYEAMVYLKREEQKDLELLT
    QLLKPSDTFVDCGANIGIWSLVAASRVSYSGKVYAFEPNPSTFKLSDNVSLSRFKNDINLISQAVGNEQKTVFFECNTTH
    NISCIKDNATRDTQEVFLTTIDQVLDGAIVNGIKIDVEGFELECLQGSYKTLIRYQPWLCVEFNTLLAKVSKLSEWNVHN
    YLKKLGYRCRHFHNALDKSQETILSDNWETKGYCNLFYFIE
    >1f2302444 2300515-2302443/gi|16329170|ref|NC_000911.1| + 643aa in:3858/2/3858/2
    MTIQYTPLADRLLAYLAADRLNLSAKSSSLNTSILLSSDLFNQEGGIVTANYGFDGYMGIPGMDGTDAESQQIAFDNNVA
    WNNLGDLSTTTQRAYTSAISTDTVQSVYGVNLEKNDNIPIVFAWPIFPTTLNPTDFQVMLNTGEIVTPVIASLIPNSEYN
    ERQTVVITGNFGNRLTPGTEGAIYPVSVGTVLDSTPLEMVGPNGPVSAVGITIDSLNPYVAGNGPKIVAAKLDRFSDLGE
    GAPLWLATNQNNSGGDLYGDQAQFRLRIYTSAGFSPDGIASLLPTEFERYFQLQAEDITGRTVILTQTGVDYEIPGFGLV
    QVLGLADLAGVQDSYDLTYIEDHDNYYDIILKGDEAAVRQIKRVALPSEGDYSAVYNPGGPGNDPENGPPGPFTVSSSPQ
    VIKVTDTIGQPTKVSYVEVDGPVLRNPFSGTPIGQEVGLAVKDLATGHEIYQYTDPDGKVFYASFAAADDQATDLTTAIA
    NPTAIDLINARGFTAGSSVTVSGSYSREAFFDGSMGFYRLLDDNGAVLDPLTGGVINPGQVGYQEAALADSNRLQATGST
    LTAEDLETRAFSFNILGGELYAPFLTVNDSLSGINQTYFAFGSANPDGISHSTNLGPNVIGFEDFLGGGDLDFDDIIVRF
    TLT
    >1f63589 61834^61936^61942^61966^61969-63588/gi|16329170|ref|NC_000911.1| + 540~585aa in:3375/2/1620/1
    MinffsthidrlgdwypqlyrelksrftatkvrwLlLvsvifqgVMVFFRTGEIPVLYPLNPAGEQFSRYCLGTPPDWEY
    SRGIFVCTQDLLGQLQINWRLWWLDGFAFLSLAGLALLLVAGVYLLVADLQKECQRGTLNFIRLSPQGEGNFIWGKMLGV
    PSLLYGFLLTLLPLHIVAAGGAGISLLLLAGYYAVVLAGATFFFHIALWIGLSSNAKSYSLSKSAAIAGLCGVGTLIATT
    LIMQDNDWEPFFLSWLSLFYPGKALIYLVRSTFLPITTVGYLGPNELDQLRWYGWDLFRSAPLGMGFMVANFAVGTYWIA
    QVLRRRFRRPLSTAWSKVQSVGVTLSLVAIANGFLLQSYVKGDYLDSLLLNLASWQLTLCCFFLGLTLALCPQINYLRDW
    SRYRHEAPRQYRTWSWQNLVADHSPPQGAIAINLCCTALLTLPMVLLLPWLAPAPAGFPIPLGGIVVALTMGLLWNFTFA
    TLVQWSLLRMRFPRLLVLILSVVVMVVLPLAIAIGAGIKESTVMWFSPLPSIALVEGISFQTPLFFLTILTQTVVIAAST
    WQFNRYVQRLGRSESQQYLAPVQPE


[AaSeq] Amino-acid Sequence File


[NtSeq] Nucleotide Sequence File

  • Format
  • >'orfId' <two_spaces> 'orfFullName' <two_spaces> 'strand' <two_spaces> 'shortestNtLen'~'longestNtLen'nt
    'startNtSeq''shortestNtSeq'
    • Refer to CONSORF Data and Formats for the description on each field.
    • 'shortestNtLen' is the number of nucleotides from the shortest start to the end positions.
    • 'longestNtLen' is the number of nucleotides from the longest start to the end positions.
    • CDSs were ordered by genomic coordinates.
    • Nucleotide residues are non-redundant even though there exist frame change positions.
    • At the end of each nucleotide sequence, the stop codon sequence was attached. However, the coordinate of the CDS does not include the stop codon.
  • Example: CDSs of Synechocystis sp. PCC 6803
  • >1f1492 802^811-1491/gi|16329170|ref|NC_000911.1| + 684~693nt
    TTGattgcgTTGAATAAAACTCCCCAAACCATTGTTTTTTACAAACCCTACGGAGTTCTGTGTCAATTTACCGATAATTC
    TGCCCATCCCCGGCCGACGTTGAAGGATTATATTAATTTGCCAGATTTATATCCCGTGGGGCGTTTGGATCAAGATAGCG
    AAGGACTATTGCTGCTCACCAGCAACGGTAAACTTCAGCATCGTTTGGCCCACCGGGAGTTTGCCCACCAACGTACTTAT
    TTTGCCCAAGTAGAAGGCTCTCCAACGGACGAAGACCTAGAACCCCTGCGGCGGGGCATAACTTTCGCGGATTACCCTAC
    CAGACCGGCGATCGCCAAAATTATCACTGAACCAGATTTTCCCCCCAGAAATCCTCCCATTCGTTATCGAGCCTCCATTC
    CCACCAGTTGGTTAAGCATTACCCTAACGGAGGGGCGCAATCGTCAGGTACGTCGAATGACAGCGGCAGTGGGCTTCCCT
    ACCCTACGATTGGTGCGGGTGCAAATACAGGTTACTGGTCGCTCTCCCCAACAGGGCAAAGGTAAGTCAGCAGCAACTTG
    GTGCTTAACCCTAGAAGGTTTGAGTCCGGGGCAATGGCGACCCCTGACCCCTTGGGAAGAAAATTTTTGCCAGCAACTCT
    TAACGGGAAATCCCAATGGTCCCTGGCAGAAAAAATTTGGCGATCGCCGTTGA
    >1f2096 1577-2095/gi|16329170|ref|NC_000911.1| + 522nt
    ATGTCCTATCTAATCGCTGTGGTAGCCAACCGCATTGCCGCCGAAGAAGCTTATACAACCTTGGAACAGGCAGGATTTGC
    CCAAAAGAATTTGACTATCATTGGCACAGGTTATAAAACCGCTGACGAATTTGGCTTGGTGGACCCGAAAAAACAAGCTA
    TCAAAAGGGCAAAGCTCATGGCCATCTGGTTAGTACCCTTTGGTTTCGCTGCCGGTTATTGCTTTAACCTCATCACTGGC
    TTGAGCACCTTAGATTGGGCTGGAGACCCCGGTAACCACATTGTGGGCGGCCTCCTAGGGGCGATCGGTGGAACCATGGG
    GAGTTTCTTTGTCGGTGGGGGCGTGGGCTTAAGCTTTGGCAGTGGGGACAGTTTGCCCTATCGAAACCTTTTGCAAGCGG
    GGAAATATTTGGTAGTGGTGGCCGGTGGTGAACTGCAAAAACAACGGGCAACCAATTTACTCCGGCCCCTCAATCCTGAA
    TATCTCCAGGGTTATACCGCCCCCGATGAAGCTTTTGTTTGA
    >1r120255 120537-120301,120297-120256/gi|16329170|ref|NC_000911.1| - 285nt
    ATGGCAACCATAACTGATTCCGATCTACAGGAACTGAAAGACTTAATCAATGGGCTTGATAAAAAAATCGACGTTAATCA
    GGCTCGGATCGATGAAAGGTTAAATGCAATAGAATCCAACCTATCAGACCTCAAAAAACAGGCTGATAAACAGGACAACC
    GCTTATGGGTTCTCATTTCGGGGATGTTTATTGCACTTCTGGGGATTTTGACAAAGTTTGCATTTTTCCCCAACCCTTAG
    CCTTCTAAGAAGATATTCCCTCCTAAATCGCCACTACGGGAATGA
    >1f190050 189813^189822-189848,189852-189875,189879-190049/gi|16329170|ref|NC_000911.1| + 231~240nt
    ATGgtcaccATGGGAAAAGCCTCCCCTGGAAGCTTATAATGCTTACAGAGAAGGCTTTTCAGCTGAATCGGAGCGGCGGG
    ATTTGAACCCACGACCCCCACTACCCCAAAGTGGTGCGCTACCAAGCTGCGCTACGCCCCGAATTTCACAGACCCTAATC
    TTAGTCCTCCCCTGTGGCCCTTGGCAAGTTTTTTGGCAAATATTTTCGAGTGTTTATTTGATGAAATTTATTGGCATTGA


[Coord] Coordinate File

  • Format
  • 'orfId' <tab> 'orfFullName' <tab> MinimalPosition <tab> MaximalPosition <tab> 'strand'
    • Refer to CONSORF Data and Formats for the description on each field.
    • MinimalPosition is the minimum between 'longestStart' and 'end' positions.
    • MaximalPosition is the maximum between 'longestStart' and 'end' positions.
  • Example: CDSs of Synechocystis sp. PCC 6803
  • 1f1492 802^811-1491/gi|16329170|ref|NC_000911.1| 802 1491 +
    1r2174 2873-2175/gi|16329170|ref|NC_000911.1| 2175 2873 -
    1r120255 120537-120301,120297-120256/gi|16329170|ref|NC_000911.1| 120256 120537 -
    1f190050 189813^189822-189848,189852-189875,189879-190049/gi|16329170|ref|NC_000911.1| 189813 190049 +


[Eval] Evaluation File

Table 1. Four refinement levels of criteria for the 'found' 'public CDSs', utilized in the evaluation of predicted CDSs.

Level Label for recognition Features of predicted CDSs compared with 'public CDSs'
1 stop only Stop position
2 candidate start Stop position, frame change, and any position of candidate starts
3 start coverage Stop position , frame change, and the length coverage (>90%) of the shortest start
4 exact start Stop position, frame change, and the exact position of the shortest start
  • Level 2: a 'public CDS' is considered 'found' if its stop position and frame change are the same as those of a predicted CDS, and its start position matches one of the candidate start positions of the predicted CDS.
  • Level 3: a 'public CDS' is considered 'found' if its stop position and frame change are the same as those of a predicted CDS, and the length coverage from the shortest start to the end position of the predicted CDS exceeds 90% with respect to both the 'public CDS' and the predicted CDS.
  • Level 4: A 'public CDS' is considered 'found' if its stop position and frame change are the same as those of a predicted CDS, and its start position is exactly the same as the shortest start position of the predicted CDS.
  • Accuracy(%) is either sensitivity(%) or specificity(%).
  • Sensitivity = TP/(TP+FN) = (number of correctly-predicted CDSs) / (number of all the public CDSs)
  • Specificity = TP/(TP+FP) = (number of correctly-predicted CDSs) / (number of all the predicted CDSs)
  • Coordinate files of 'public CDSs' were referenced to evaluate CONSORF-predicted CDSs.


CONSORF Search

CONSORF Organism Search

  • Search for organisms with an organism ID, an organism name, or a taxonomic description.
  • Partial IDs, names, or descriptions are also allowed.
  • Example
  • Organism IDs: oi1148, oi83333, oi224308, ...
    Organism names: Synechocystis sp. PCC 6803, Escherichia coli K12, Bacillus subtilis subsp. subtilis str. 168, ...
    Taxonomic descriptions: Bacteria, Cyanobacteria, Chroococcales, Synechocystis, ...


CONSORF CDS Search

  • Search for CDSs with an CDS ID, a protein description, a locus tag, or included external DB IDs.
  • Partial IDs, names, or descriptions are also allowed.
  • Example: Synechocystis sp. PCC 6803 (oi1148)
  • CDS IDs: 1f2065521, 1r1916535, 1f3566647, ...
    Protein descriptions: ATPase, elongation factor EF-2, GTP-binding protein LepA, ...
    Locus tags: slr1641, sll1098, slr0604, ...
    External DB (NCBI GI) IDs: 16331048, 16330914, 16332331, ...



contact: suskang@kribb.re.kr