A consensus prediction system for prokaryotic CDSs
contact: suskang@kribb.re.kr
CONSORF: a consensus prediction system for prokaryotic open reading frames
While the number of known prokaryotic whole genomes is increasing rapidly, depending on the genome, their coding sequence (CDS) predictions are inconsistent. Moreover, it is difficult to systematically update them fast enough to keep up with new knowledge from the expanding public databases. To contribute to tackling these problems, we have developed CONSORF, an automatic identification system that provides comprehensive prokaryotic CDS information. It provides intuitive reliability scores, predicted frame-shifts, alternative start sites, and best pair-wise match information against other prokaryotes. CONSORF first predicts the CDSs supported by consensus alignments from multiple genome-to-proteome comparisons with other prokaryotes using the FASTX program. Then, it fills the empty genomic regions with the CDSs supported by consensus ab initio predictions. Currently, we provide comprehensive CDS information for 330 publicly available prokaryotes identified by CONSORF. Their accuracies, validated with the NCBI RefSeq CDSs, were comparable with other high-accuracy CDS prediction programs such as GeneLook and YACOP. In a large-scale comparative analysis, we expect that CONSORF’s homology-search feature, among prokaryotic genomes, will be able to save a great deal of time and provide us with consistently high quality prediction results. The regularly updated CDS predictions of prokaryotic genomes are freely accessible through our website.
For more detailed description on the CONSORF system, refer to this supplementary information .
From a prokaryotic genome sequence, the CONSORF system predicts CDSs in two complementary approaches: homology-based and algorithm-based. In the homology-based approach, pair-wise genome-to-proteome comparisons via the FASTX program are performed to generate both ‘homology CDSs’ and ‘alternative CDSs’ while multiple ab initio predictions are conducted to provide ‘ab initio CDSs’ in the algorithm-based approach. 'Homology CDSs' are determined from the representative FASTX alignment with the highest sum of bit scores in consensus analyses regarding stop, start, and frame change positions, while 'ab initio CDSs' are determined from the consensus of the algorithm-based CDSs with the highest sum of CDS nucleotide lengths in the consensus analyses regarding only stop and start positions. On the contrary, 'alternative CDSs' are directly determined from the FASTX alignments with the highest individual bit score across all the pair-wise comparisons. By integrating the complementary ‘homology CDSs’ and ‘ab initio CDSs’, avoiding a significant positional overlap on the genome, the 'integrated CDSs' were predicted with high accuracy. To determine the more likely start site among candidate starts, the ‘integrated CDSs’ aligned with N-terminal residues in the pair-wise FASTX comparisons were inspected to provide the final 'representative CDSs' .
(A) A region on the fragment (from 2097001 to 2115000 base pairs) of the Bacillus subtilis (oi224308) genome (‘gi|50812173|ref|NC_000964.2|’) was best aligned in terms of the sum of bit scores with a Bacillus halodurans (oi2725558) protein (‘gi|15614769|ref|NP_243072.1|’) via a FASTX homology search across all the available organisms. Two frame changes composed of one base insertion (slash) and one base deletion (backslash) were detected (dotted arrows). Some parts of the FASTX alignment were omitted (points of ellipsis).
(B) The one-line header information of the predicted CDS in FASTA format. The header line starts with a close angle bracket, and each field shown over multiple lines for clarity only is separated by two spaces. The CDS ID (‘1r2107988’) represents a sequential number (‘1’) assigned to each genome of an organism from the longest to the shortest, a strand symbol (either ‘f’ for forward or ‘r’ for reverse), and the stop codon position (‘2107988’) of the CDS. The front part (‘2110811^2110796^2110784’) of the coordinate represents the positions of candidate starts. ‘2109969,2109969’ and ‘2109868,2109866’ in the middle represent frame-shift positions. ‘Strand’ is either forward (plus) or reverse (minus). ‘Amino acid length’ denotes both the sequence lengths from the shortest and the longest starts to the end positions. Reliability information is provided in three different types of consensus regarding stop only (type A: ‘54119.6/127’), stop and start only (type B: ‘23297/40’), and stop, start, and frame change (type C: ‘6462.4/8’). For each type of reliability information, the sum of bit scores and the number of occurrences are separated by a slash. The best bit score (‘1059.9’) of the representative FASTX alignment was also denoted at the end. The prefix ‘ex:’ was used to distinguish extrinsically-predicted ‘homology CDSs’ (‘ex:’) from intrinsically predicted ‘ab initio CDSs’ (‘in:’). The consensus-based best hit information from the representative FASTX alignment is composed of four fields: organism ID (‘oi272558’), gene ID (‘gi|15614769|ref|NP_243072.1|’), gene description (‘oxoglutarate dehydrogenase’), and organism name (‘Bacillus halodurans C-125’).
(C) The amino acid sequence of the predicted CDS. The sequences from the shortest start to the end enclosing the representative FASTX alignment are in capital letters. The candidate start sites are also represented in upper case, while those upstream of the start sites are in lower case. Most internal residues of the sequence were covered by the representative FASTX alignment (horizontal arrows), and three candidate starts were suggested accordingly.
It displays the predicted Helicobacter pylori 26695 CDSs in the genomic region from 885,000 to 911,000 base pairs. Most of the predicted CDSs are consistent with public CDSs with some minor variations. Potential frameshifts and candidate start sites are represented by vertical bars or blue and red, respectively.
(A) All the CDSs including ‘homology CDS’, ‘alternative CDS’, ‘ab initio CDS’, ‘integrated CDS’, ‘representative CDS’, and ‘public CDS’ were consistent. (B) The density of color represents CDS reliability based on homology-based and algorithm-based consensus. The CDSs in (B) had lower reliability scores than (A). (C) One base insertion near the start position extended the homology-based CDSs that had reliability scores comparable to those in (A). It needs further manual inspection for its authentic frame-shift or sequencing error. Instead of the frame-shifted CDS, one additional short CDS was predicted in ‘ab initio CDS’ and ‘public CDS’. (D) Four candidate starts were found in ‘homology CDS’, ‘alternative CDS’, ‘ab initio CDS’, ‘integrated CDS’, and ‘representative CDS’. However, the shortest starts were consistent with the start of ‘public CDS’. (E) Homology-based CDS was not found in this case. The longest start among the candidate starts of ‘ab initio CDS’, ‘integrated CDS’, and ‘representative CDS’ was consistent with the start of ‘public CDS’. (F) The detailed information on the clicked ‘representative CDS’ was displayed.
ID: ORGANISM_DESCRIPTION
oi83333: Escherichia coli K12
oi224308: Bacillus subtilis subsp. subtilis str. 168
oi1148: Synechocystis sp. PCC 6803
oi279010: Bacillus licheniformis ATCC 14580
oi279010.1: Bacillus licheniformis ATCC 14580 (DSM 13)
ORGANISM_NAME: ORGANISM_DESCRIPTION
Escherichia_coli_K12: Escherichia coli K12
Bacillus_subtilis: Bacillus subtilis subsp. subtilis str. 168
Synechocystis_PCC6803: Synechocystis sp. PCC 6803
Bacillus_licheniformis_ATCC_14580: Bacillus licheniformis ATCC 14580
Bacillus_licheniformis_DSM_13: Bacillus licheniformis ATCC 14580 (DSM 13)
gi|16329170|ref|NC_000911.1|
gi|38505535|ref|NC_005229.1|
gi|38505825|ref|NC_005232.1|
CDS_ID: CDS_FULL_NAME(=CDS_POSITION/GENOME_ID)
1f2065521: 2062863^2062905-2065520/gi|16329170|ref|NC_000911.1|
1r1916535: 1918608-1916536/gi|16329170|ref|NC_000911.1|
1f3566647: 3564838-3566646/gi|16329170|ref|NC_000911.1|
2f25953: 24777^24798-25952/gi|38505535|ref|NC_005229.1|
2r77194: 78010^77995^77941-77195/gi|38505535|ref|NC_005229.1|
3f42309: 39171-42308/gi|38505825|ref|NC_005232.1|
3r57135: 57666-57136/gi|38505825|ref|NC_005232.1|
CDS_FULL_NAME(=CDS_POSITION/GENOME_ID): CDS_ID
2062863^2062905-2065520/gi|16329170|ref|NC_000911.1|: 1f2065521
1918608-1916536/gi|16329170|ref|NC_000911.1|: 1r1916535
3564838-3566646/gi|16329170|ref|NC_000911.1|: 1f3566647
24777^24798-25952/gi|38505535|ref|NC_005229.1|: 2f25953
78010^77995^77941-77195/gi|38505535|ref|NC_005229.1|: 2r77194
39171-42308/gi|38505825|ref|NC_005232.1|: 3f42309
57666-57136/gi|38505825|ref|NC_005232.1|: 3r57135
853281-852919,852919-852473/gi|16329170|ref|NC_000911.1|: 1r852472
CDS_ID (CDS_FULL_NAME): STRAND END LONGEST_START SHORTEST_START CANDIDATE_STARTS(separated by a hat sign)
1f2065521 (2062863^2062905-2065520/gi|16329170|ref|NC_000911.1|): + 2065520 2062863 2062905 2062863^2062905
1r1916535 (1918608-1916536/gi|16329170|ref|NC_000911.1|): - 1916536 1918608 1918608 1918608
1r852472 (853281-852919,852919-852473/gi|16329170|ref|NC_000911.1|): - 852473 853281 853281 853281
EXPECTED_FRAME_CHANGE: CDSID (CDS_FULL_NAME)
One insertion: 1r852472 (853281-852919,852919-852473/gi|16329170|ref|NC_000911.1|)
One deletion: 1f2205370 (2204157-2204573,2204575-2205369/gi|16329170|ref|NC_000911.1|)
One in-frame stop: 1f1156501 (1154857-1156080,1156084-1156500/gi|16329170|ref|NC_000911.1|)
METHOD('ex'): STOP_SCORE/STOP_COUNT/START_SCORE/START_COUNT/EXACT_SCORE/EXACT_COUNT/BEST_SCORE
or
METHOD('in'): STOP_SCORE/STOP_COUNT/EXACT_SCORE/EXACT_COUNT
ex:231688.2/292/216693.4/260/215633/258/1183.1
ex:162691.1/319/162691.1/319/162691.1/319/668.9
ex:2505.8/12/1981.2/6/1981.2/6/424.8
in:10410/2/5163/1
in:7227/3/7227/3
in:1512/2/1512/2
BEST_MATCH_ORG_ID:BEST_MATCH_PROTEIN_ID: BEST_MATCH_PROTEIN_DESC (BEST_MATCH_ORG_DESC)
oi240292:gi|75908552|ref|YP_322848.1|: ATPase (Anabaena variabilis ATCC 29413)
oi251221:gi|37521338|ref|NP_924715.1|: carbamoyl-phosphate synthase large subunit (Gloeobacter violaceus PCC 7421)
oi103690:gi|17230263|ref|NP_486811.1|: dihydroxy-acid dehydratase (Nostoc sp. PCC 7120)
oi316279:gi|78183707|ref|YP_376141.1|: translocase (Synechococcus sp. CC9902)
| OrfID/GenomeID <tab> GenomicCoordinate <tab> Strand <tab> ProteinID <tab> DbXrefs <tab> GeneSymbol <tab> Product |
slr0611/gi|16329170|ref|NC_000911.1| 3573271-3573470,1-772 + NP_439899.1 GI:16329171;GeneID:951850 sds solanesyl diphosphate synthase
slr0612/gi|16329170|ref|NC_000911.1| 937-1494 + NP_439900.1 GI:16329172;GeneID:951851 - hypothetical protein
sll1212/gi|16329170|ref|NC_000911.1| 6622-5534 - NP_439905.1 GI:16329177;GeneID:951882 rfbD GDP-D-mannose dehydratase
ssl5001/gi|38505535|ref|NC_005229.1| 374-195 - NP_942157.1 GI:38505536;GeneID:2655889 - hypothetical protein
|
>OrfID/GenomeID <two_spaces> ProteinID <two_spaces> DbXrefs <two_spaces> GeneSymbol <two_spaces> Product Amino acid sequence |
>slr0611/gi|16329170|ref|NC_000911.1| NP_439899.1 GI:16329171;GeneID:951850 sds solanesyl diphosphate synthase
MISTTSLFAPVDQDLRLLTDNLKRLVGARHPILGAAAEHLFEAGGKRVRPAIVLLVSRATLLDQELTARHRRLAEITEMI
HTASLVHDDVVDEADLRRNVPTVNSLFDNRVAVLAGDFLFAQSSWYLANLDNLEVVKLLSEVIRDFAEGEILQSINRFDT
DTDLETYLEKSYFKTASLIANSAKAAGVLSDAPRDVCDHLYEYGKHLGLAFQIVDDILDFTSPTEVLGKPAGSDLISGNI
TAPALFAMEKYPLLGKLIEREFAQAGDLEQALELVEQGDGIRRSRELAANQAQLARQHLSVLEMSAPRESLLELVDYVLG
RLH
>slr0612/gi|16329170|ref|NC_000911.1| NP_439900.1 GI:16329172;GeneID:951851 - hypothetical protein
MGRLDQDSEGLLLLTSNGKLQHRLAHREFAHQRTYFAQVEGSPTDEDLEPLRRGITFADYPTRPAIAKIITEPDFPPRNP
PIRYRASIPTSWLSITLTEGRNRQVRRMTAAVGFPTLRLVRVQIQVTGRSPQQGKGKSAATWCLTLEGLSPGQWRPLTPW
EENFCQQLLTGNPNGPWQKKFGDRR
|
>OrfID/GenomeID <two_spaces> GenomicCoordinate <two_spaces> Strand Nucleotide sequence |
>slr0611/gi|16329170|ref|NC_000911.1| 3573271-3573470,1-772 +
ATGATCTCCACTACCTCCCTGTTTGCCCCCGTTGACCAAGACCTCCGTTTATTAACGGATAATCTCAAGCGGCTTGTCGG
TGCTCGGCATCCTATCCTGGGGGCGGCGGCGGAACATTTATTTGAGGCAGGGGGAAAGCGGGTGCGGCCGGCCATTGTGT
TGTTAGTTTCCCGCGCAACCCTATTAGACCAAGAATTAACGGCGCGCCATCGCCGGCTGGCGGAAATTACCGAAATGATC
CACACCGCTAGTTTGGTCCACGATGACGTGGTGGATGAGGCGGATCTGCGGCGGAATGTGCCCACGGTGAATAGTTTATT
TGACAATCGGGTGGCAGTGTTAGCGGGGGATTTCCTCTTTGCCCAATCTTCTTGGTATTTGGCTAACTTAGATAATTTGG
AGGTGGTGAAATTATTATCGGAGGTAATTCGGGACTTTGCGGAGGGGGAAATTTTACAGAGCATCAATCGTTTTGACACC
GACACAGATTTAGAAACCTATTTGGAAAAAAGCTATTTTAAAACCGCCTCTCTCATTGCCAACAGTGCCAAGGCAGCGGG
GGTTTTGAGCGATGCGCCCCGGGATGTGTGTGATCATCTTTACGAATATGGTAAACATTTGGGGTTAGCGTTCCAGATTG
TGGACGATATTTTAGATTTCACTTCCCCCACGGAGGTTTTGGGGAAACCGGCCGGGTCAGATTTAATCAGCGGCAACATC
ACCGCCCCAGCCCTATTTGCCATGGAAAAATATCCCCTACTTGGTAAATTAATTGAACGGGAATTTGCCCAGGCGGGGGA
TTTGGAACAGGCCCTGGAATTGGTAGAACAGGGGGATGGTATCCGGCGATCAAGGGAATTGGCCGCGAACCAAGCGCAAC
TGGCCCGGCAACATCTGAGTGTGCTGGAAATGTCCGCTCCGAGAGAATCTCTGTTGGAATTAGTTGATTATGTGCTTGGT
CGTCTCCATTAG
>sll0558/gi|16329170|ref|NC_000911.1| 2873-2172 -
ATGTCTGATAATTTGACCGAACTCTCCCAACAACTCCATGATGCTTCAGAAAAAAAACAGTTGACGGCGATCGCCGCTTT
GGCAGAAATGGGAGAAGGGGGCCAGGGAATATTACTCGATTATTTGGCCAAAAATGTCCCCCTAGAAAAGCCAGTGTTGG
CGGTGGGTAACGTCTACCAAACCCTCCGGAATCTAGAACAGGAAACCATCACAACGCAACTCCAACGGAATTACCCCACA
GGCATTTTCCCCTTACAATCGGCCCAGGGCATTGATTATCTGCCGCTCCAGGAAGCCCTAGGAAGCCAGGATTTTGAAAC
AGCGGATGAAATAACCCGGGATAAATTGTGCGAACTGGCGGGGCCTGGGGCCAGTCAAAGACAATGGCTCTATTTCACAG
AAGTAGAAAAATTTCCTGCCCTAGACCTGCACACCATTAATGCTTTGTGGTGGCTCCACTCCAACGGTAATTTTGGTTTT
TCGGTGCAACGACGACTCTGGTTGGCGTCCGGAAAAGAATTTACCAAGCTTTGGCCGAAAATTGGCTGGAAAAGCGGCAA
TGTTTGGACCCGTTGGCCCAAGGGCTTTACCTGGGATTTATCCGCACCCCAGGGTCATTTACCCCTGTTAAACCAATTGC
GGGGGGTAAGGGTAGCAGAATCCCTTTACAGGCACCCAGTTTGGTCCCAATACGGTTGGTAA
|
<consorf> <orf> <orfId>string</orfId> <orfFullName genomeId="string" orfPosition="string"> <coordinate strand="+ or -" end="positiveInteger" candidateStarts="list of positiveInteger" shortestStart="positiveInteger" longestStart="positiveInteger" shortestAaLen="positiveInteger" longestAaLen="positiveInteger" /> <frameChange(optional) insertion="list of positiveInteger" deletion="list of positiveInteger" inFrameStop="string(list of coordinate ranges)" multiFrameChange="string(list of coordinate ranges)" /> <consensus method="in or ex" stopScore="decimal" stopCount="positiveInteger" exactScore="decimal" exactCount="positiveInteger" startScore(optional)="decimal" startCount(optional)="positiveInteger" bestScore(optional)="decimal" oldStarts(optional)="list of positiveInteger" /> <bestMatch(optional) bestOrgId="string" bestOrgDesc="string" bestProteinId="string" bestProteinDesc="string" /> <startAaSeq(optional)>string</startAaSeq> <shortestAaSeq>string</shortestAaSeq> <startNtSeq(optional)>string</startNtSeq> <shortestNtSeq>string</shortestNtSeq> </orf> ... </consorf> |
<example>
392 446 509 533 545 662 800 827
931 1144
7835
<example>
inFrameStop: 10290-10288
inFrameStop: 10541-10543 10607-10609
multiFrameChange: 1470704-1470699
multiFrameChange: 2756354-2756362
<example>
931 1144 1150
19890 19971 19974 19986 20025 20040 20067 20070
<example>
Mqpkqravp
MpflqciMrssLyfrakspgyla
<example>
GTGcaacccaagcagagggctgtcccc
GTGccctttctccaatgcattATGcgatcgtccTTGtattttcgagctaaatctcccggttatcttgcc
| Case 1: homology-based CDSs |
|---|
|
>'orfId'
<two_spaces> 'orfFullName' <two_spaces> 'strand' <two_spaces> 'shortestAaLen'~'longestAaLen'aa <two_spaces> 'method':'stopScore'/'stopCount'/'startScore'/'startCount'/'exactScore'/'exactCount'/'bestScore'(start:'oldStarts') <two_spaces> 'bestOrgId':'bestProteinId': 'bestProteinDesc' ('bestOrgDesc') 'startAaSeq''shortestAaSeq' |
| Case 2: algorithm-based ab initio CDSs |
|
>'orfId'
<two_spaces> 'orfFullName' <two_spaces> 'strand' <two_spaces> 'shortestAaLen'~'longestAaLen'aa <two_spaces> 'method':'stopScore'/'stopCount'/'exactScore'/'exactCount'(start:'oldStarts') 'startAaSeq''shortestAaSeq' |
<example>
ex:221560/244/87344.8/59/87344.8/59/1989.1
ex:127930.9/242/85347.8/158/85347.8/158/966.6(start:599082^599067^599064)
in:5358/2/2355/1
in:4140/2/2061/1(start:600145^600148^600163)
>1f2065521 2062863^2062905-2065520/gi|16329170|ref|NC_000911.1| + 872~886aa ex:231688.2/292/216693.4/260/215633/258/1183.1 oi240292:gi|75908552|ref|YP_322848.1|: ATPase (Anabaena variabilis ATCC 29413)
MvvlthpianienfMQPTDPNKFTEKAWEAIAKTPEIAKQHRQQQIETEHLLSALLEQNGLATSIFNKAGASIPRVNDQV
NSFIAQQPKLSNPSESIYLGRSLDKLLDNAEIAKSKYGDDYISIEHLMAAYGQDDRLGKNLYREIGLTENKLAEIIKQIR
GTQKVTDQNPEGKYESLEKYGRDLTELAREGKLDPVIGRDEEVRRTIQILSRRTKNNPVLIGEPGVGKTAIAEGLAQRII
NHDVPESLRDRKLISLDMGALIAGAKYRGEFEERLKAVLKEVTDSQGQIILFIDEIHTVVGAGATQGAMDAGNLLKPMLA
RGALRCIGATTLDEYRKYIEKDAALERRFQEVLVDEPNVLDTISILRGLKERYEVHHGVKIADSALVAAAMLSNRYISDR
FLPDKAIDLVDEAAAKLKMEITSKPEELDEVDRKILQLEMERLSLQRENDSASKERLEKLEKELADFKEEQSKLNGQWQS
EKTVIDQIRTVKETIDQVNLEIQQAQRDYDYNKAAELQYGKLTDLQRQVEALETQLAEQQTSGKSLLREEVLESDIAEII
SKWTGIPISKLVESEKEKLLHLEDELHSRVIGQDEAVTAVAEAIQRSRAGLSDPNRPTASFIFLGPTGVGKTELAKALAK
NLFDTEEALVRIDMSEYMEKHAVSRLMGAPPGYVGYEEGGQLTEAIRRRPYSVILFDEIEKAHGDVFNVMLQILDDGRLT
DAQGHVVDFKNTIIIMTSNLGSQYILDVAGDDSRYEEMRSRVMDVMRENFRPEFLNRVDETIIFHGLQKSELRSIVQIQI
QSLATRLEEQKLTLKLTDKALDFLAAVGYDPVYGARPLKRAVQKYLETAIAKGILRGDYKPGETIVVDETDERLSFTSLR
GDLVIV
>1r1916535 1918608-1916536/gi|16329170|ref|NC_000911.1| - 691aa ex:219901/318/179591.8/240/179591.8/240/1078.4 oi59919:gi|33862065|ref|NP_893626.1|: elongation factor EF-2 (Prochlorococcus marinus subsp. pastoris str. CCMP1986)
MARTVPLERIRNIGIAAHIDAGKTTTTERILFYSGVVHKIGEVHEGTAVTDWMAQERERGITITAAAISTDWLGHHINII
DTPGHVDFTIEVERSMRVLDGVIAVFCSVGGVQPQSETVWRQAERYQVPRIAFVNKMDRTGANFFRVCQQIGDRLRANAV
PVQIPIGSEAEFEGIVDLVRMKAYLYKNDLGTDIQEVPIPDSVKDKTEEYRLRLVESVAEADDALMEKYLEGEELTADEL
VAGLRRGTIAGTMVPVLCGSAFKNKGVQLLLDAVVDYLPSPLEVPAIEGHLPDGEVATRPAEDKAPLSALAFKVMADPFG
RLTFVRVYSGVLEKGSYVLNSTKEKKERISRLIILKADDRIEVDQLNAGDLGAVLGLKDTLTGDTLCDDQEPIILESLFV
PQPVISVAVEPKTKQDMDKLSKALQSLSEEDPTFRVSVDPETNQTVIAGMGELHLEILVDRMLREFKVEANVGAPQVAYR
ETIRKAVQAEGKFIRQSGGKGQYGHVVIEVEPTEPGTGFEFVSKIVGGVIPKEYIAPSEQGMKEACASGVLAGYPVIDLK
ATLVDGSFHDVDSSEMAFKIAGSMAIREAVGQADPVLLEPVMKVEIEVPDDFMGNVIGDLNARRGHIEGQETEQGIAKVA
ASVPLAEMFGYATDIRSKTQGRGIFSMEFSHYAEVPRNVAEAIVAKSRGYA
>1f1156501 1154857-1156080,1156084-1156500/gi|16329170|ref|NC_000911.1| + 547aa ex:28344.6/179/3913.9/16/3913.9/16/631.0 oi103690:gi|17227685|ref|NP_484233.1|: hypothetical protein alr0189 (Nostoc sp. PCC 7120)
MFALPQAGDRRGEIIKVLLSNGWDYMNGLLTLGKVGEPQIPTPEVLTKILVELGPFYIKLGQLLSTRPDLLPPRYINALT
ALQSNVPPLPWSAIEDLLQREFPQPLGETFQEIESEPIAAGSIGQIHRAVLQSGETVAIKVKRPGIDVIVEQDSLLIKDV
AELLALTEFGQNYDIVKLADEFTQTVKAELNFDTEAAYTNNLRTNLAKTTWFDPNQLVIPKVYWELTNQKFLVLEWLDGV
PILTADLTQPPSDKDIAEKKKEITTLLFRAFFQQLYVDGFFHADPHPGNIFYLADGRLALIDCGMVGRLDPRTRQLLTEM
LLAIVDLDAKRCAQLTVELSESVGRVNFQRLEVDYERMLRKYYDLSLSEFNFSEVVYEFLRIARVNKLKVPACLGLYAKC
LANLEGAGQFNPELNLFTEINPLITDLFRRQLFGTNPLQTALRTVLDLKAVSLKTPRQMDVLLDRLTTETLQWNVRLEGL
EPVRRTIDKSANRLSFSIVLGSLIMGAAILSTGNDQQLTLIANILFVAATVIGFWLVISILRSGRLK
>1f373642 372796^372811-373164,373168-373641/gi|16329170|ref|NC_000911.1| + 276~281aa ex:2897.5/48/207.8/3/207.8/3/88.0 oi243090:gi|32471992|ref|NP_864986.1|: conserved hypothetical protein-putative methyltransferase (Rhodopirellula baltica SH 1)
MykniMKTTINDYIGQFIKTTPEFKGKWRIIRYWMNQNKDHRTKYRILPGGEKILCDLSIPYEAMVYLKREEQKDLELLT
QLLKPSDTFVDCGANIGIWSLVAASRVSYSGKVYAFEPNPSTFKLSDNVSLSRFKNDINLISQAVGNEQKTVFFECNTTH
NISCIKDNATRDTQEVFLTTIDQVLDGAIVNGIKIDVEGFELECLQGSYKTLIRYQPWLCVEFNTLLAKVSKLSEWNVHN
YLKKLGYRCRHFHNALDKSQETILSDNWETKGYCNLFYFIE
>1f2302444 2300515-2302443/gi|16329170|ref|NC_000911.1| + 643aa in:3858/2/3858/2
MTIQYTPLADRLLAYLAADRLNLSAKSSSLNTSILLSSDLFNQEGGIVTANYGFDGYMGIPGMDGTDAESQQIAFDNNVA
WNNLGDLSTTTQRAYTSAISTDTVQSVYGVNLEKNDNIPIVFAWPIFPTTLNPTDFQVMLNTGEIVTPVIASLIPNSEYN
ERQTVVITGNFGNRLTPGTEGAIYPVSVGTVLDSTPLEMVGPNGPVSAVGITIDSLNPYVAGNGPKIVAAKLDRFSDLGE
GAPLWLATNQNNSGGDLYGDQAQFRLRIYTSAGFSPDGIASLLPTEFERYFQLQAEDITGRTVILTQTGVDYEIPGFGLV
QVLGLADLAGVQDSYDLTYIEDHDNYYDIILKGDEAAVRQIKRVALPSEGDYSAVYNPGGPGNDPENGPPGPFTVSSSPQ
VIKVTDTIGQPTKVSYVEVDGPVLRNPFSGTPIGQEVGLAVKDLATGHEIYQYTDPDGKVFYASFAAADDQATDLTTAIA
NPTAIDLINARGFTAGSSVTVSGSYSREAFFDGSMGFYRLLDDNGAVLDPLTGGVINPGQVGYQEAALADSNRLQATGST
LTAEDLETRAFSFNILGGELYAPFLTVNDSLSGINQTYFAFGSANPDGISHSTNLGPNVIGFEDFLGGGDLDFDDIIVRF
TLT
>1f63589 61834^61936^61942^61966^61969-63588/gi|16329170|ref|NC_000911.1| + 540~585aa in:3375/2/1620/1
MinffsthidrlgdwypqlyrelksrftatkvrwLlLvsvifqgVMVFFRTGEIPVLYPLNPAGEQFSRYCLGTPPDWEY
SRGIFVCTQDLLGQLQINWRLWWLDGFAFLSLAGLALLLVAGVYLLVADLQKECQRGTLNFIRLSPQGEGNFIWGKMLGV
PSLLYGFLLTLLPLHIVAAGGAGISLLLLAGYYAVVLAGATFFFHIALWIGLSSNAKSYSLSKSAAIAGLCGVGTLIATT
LIMQDNDWEPFFLSWLSLFYPGKALIYLVRSTFLPITTVGYLGPNELDQLRWYGWDLFRSAPLGMGFMVANFAVGTYWIA
QVLRRRFRRPLSTAWSKVQSVGVTLSLVAIANGFLLQSYVKGDYLDSLLLNLASWQLTLCCFFLGLTLALCPQINYLRDW
SRYRHEAPRQYRTWSWQNLVADHSPPQGAIAINLCCTALLTLPMVLLLPWLAPAPAGFPIPLGGIVVALTMGLLWNFTFA
TLVQWSLLRMRFPRLLVLILSVVVMVVLPLAIAIGAGIKESTVMWFSPLPSIALVEGISFQTPLFFLTILTQTVVIAAST
WQFNRYVQRLGRSESQQYLAPVQPE
|
>'orfId' <two_spaces> 'orfFullName' <two_spaces> 'strand' <two_spaces> 'shortestNtLen'~'longestNtLen'nt 'startNtSeq''shortestNtSeq' |
>1f1492 802^811-1491/gi|16329170|ref|NC_000911.1| + 684~693nt
TTGattgcgTTGAATAAAACTCCCCAAACCATTGTTTTTTACAAACCCTACGGAGTTCTGTGTCAATTTACCGATAATTC
TGCCCATCCCCGGCCGACGTTGAAGGATTATATTAATTTGCCAGATTTATATCCCGTGGGGCGTTTGGATCAAGATAGCG
AAGGACTATTGCTGCTCACCAGCAACGGTAAACTTCAGCATCGTTTGGCCCACCGGGAGTTTGCCCACCAACGTACTTAT
TTTGCCCAAGTAGAAGGCTCTCCAACGGACGAAGACCTAGAACCCCTGCGGCGGGGCATAACTTTCGCGGATTACCCTAC
CAGACCGGCGATCGCCAAAATTATCACTGAACCAGATTTTCCCCCCAGAAATCCTCCCATTCGTTATCGAGCCTCCATTC
CCACCAGTTGGTTAAGCATTACCCTAACGGAGGGGCGCAATCGTCAGGTACGTCGAATGACAGCGGCAGTGGGCTTCCCT
ACCCTACGATTGGTGCGGGTGCAAATACAGGTTACTGGTCGCTCTCCCCAACAGGGCAAAGGTAAGTCAGCAGCAACTTG
GTGCTTAACCCTAGAAGGTTTGAGTCCGGGGCAATGGCGACCCCTGACCCCTTGGGAAGAAAATTTTTGCCAGCAACTCT
TAACGGGAAATCCCAATGGTCCCTGGCAGAAAAAATTTGGCGATCGCCGTTGA
>1f2096 1577-2095/gi|16329170|ref|NC_000911.1| + 522nt
ATGTCCTATCTAATCGCTGTGGTAGCCAACCGCATTGCCGCCGAAGAAGCTTATACAACCTTGGAACAGGCAGGATTTGC
CCAAAAGAATTTGACTATCATTGGCACAGGTTATAAAACCGCTGACGAATTTGGCTTGGTGGACCCGAAAAAACAAGCTA
TCAAAAGGGCAAAGCTCATGGCCATCTGGTTAGTACCCTTTGGTTTCGCTGCCGGTTATTGCTTTAACCTCATCACTGGC
TTGAGCACCTTAGATTGGGCTGGAGACCCCGGTAACCACATTGTGGGCGGCCTCCTAGGGGCGATCGGTGGAACCATGGG
GAGTTTCTTTGTCGGTGGGGGCGTGGGCTTAAGCTTTGGCAGTGGGGACAGTTTGCCCTATCGAAACCTTTTGCAAGCGG
GGAAATATTTGGTAGTGGTGGCCGGTGGTGAACTGCAAAAACAACGGGCAACCAATTTACTCCGGCCCCTCAATCCTGAA
TATCTCCAGGGTTATACCGCCCCCGATGAAGCTTTTGTTTGA
>1r120255 120537-120301,120297-120256/gi|16329170|ref|NC_000911.1| - 285nt
ATGGCAACCATAACTGATTCCGATCTACAGGAACTGAAAGACTTAATCAATGGGCTTGATAAAAAAATCGACGTTAATCA
GGCTCGGATCGATGAAAGGTTAAATGCAATAGAATCCAACCTATCAGACCTCAAAAAACAGGCTGATAAACAGGACAACC
GCTTATGGGTTCTCATTTCGGGGATGTTTATTGCACTTCTGGGGATTTTGACAAAGTTTGCATTTTTCCCCAACCCTTAG
CCTTCTAAGAAGATATTCCCTCCTAAATCGCCACTACGGGAATGA
>1f190050 189813^189822-189848,189852-189875,189879-190049/gi|16329170|ref|NC_000911.1| + 231~240nt
ATGgtcaccATGGGAAAAGCCTCCCCTGGAAGCTTATAATGCTTACAGAGAAGGCTTTTCAGCTGAATCGGAGCGGCGGG
ATTTGAACCCACGACCCCCACTACCCCAAAGTGGTGCGCTACCAAGCTGCGCTACGCCCCGAATTTCACAGACCCTAATC
TTAGTCCTCCCCTGTGGCCCTTGGCAAGTTTTTTGGCAAATATTTTCGAGTGTTTATTTGATGAAATTTATTGGCATTGA
|
'orfId' <tab> 'orfFullName' <tab> MinimalPosition <tab> MaximalPosition <tab> 'strand' |
1f1492 802^811-1491/gi|16329170|ref|NC_000911.1| 802 1491 +
1r2174 2873-2175/gi|16329170|ref|NC_000911.1| 2175 2873 -
1r120255 120537-120301,120297-120256/gi|16329170|ref|NC_000911.1| 120256 120537 -
1f190050 189813^189822-189848,189852-189875,189879-190049/gi|16329170|ref|NC_000911.1| 189813 190049 +
| Level | Label for recognition | Features of predicted CDSs compared with 'public CDSs' |
|---|---|---|
| 1 | stop only | Stop position |
| 2 | candidate start | Stop position, frame change, and any position of candidate starts |
| 3 | start coverage | Stop position , frame change, and the length coverage (>90%) of the shortest start |
| 4 | exact start | Stop position, frame change, and the exact position of the shortest start |
Organism IDs: oi1148, oi83333, oi224308, ...
Organism names: Synechocystis sp. PCC 6803, Escherichia coli K12, Bacillus subtilis subsp. subtilis str. 168, ...
Taxonomic descriptions: Bacteria, Cyanobacteria, Chroococcales, Synechocystis, ...
CDS IDs: 1f2065521, 1r1916535, 1f3566647, ...
Protein descriptions: ATPase, elongation factor EF-2, GTP-binding protein LepA, ...
Locus tags: slr1641, sll1098, slr0604, ...
External DB (NCBI GI) IDs: 16331048, 16330914, 16332331, ...
contact: suskang@kribb.re.kr