Published at 2023-03-07 08:26
Author:zhixy
View:2250
全基因组注释是识别一组基因组DNA序列中感兴趣的特征,并用有用的信息标记它们的过程。 Prokka是一个软件工具,可以快速注释细菌、古物和病毒基因组,并生成符合标准的输出文件。
推荐安装方式:
(base) [user@server ~]# conda install -c bioconda prokka
安装完成后,运行prokka:
(base) [user@user ~]# prokka
Name:
Prokka 1.12 by Torsten Seemann <torsten.seemann@gmail.com>
Synopsis:
rapid bacterial genome annotation
Usage:
prokka [options] <contigs.fasta>
General:
--help This help
--version Print version and exit
--docs Show full manual/documentation
--citation Print citation for referencing Prokka
--quiet No screen output (default OFF)
--debug Debug mode: keep all temporary files (default OFF)
Setup:
--listdb List all configured databases
--setupdb Index all installed databases
--cleandb Remove all database indices
--depends List all software dependencies
Outputs:
--outdir [X] Output folder [auto] (default '')
--force Force overwriting existing output folder (default OFF)
--prefix [X] Filename output prefix [auto] (default '')
--addgenes Add 'gene' features for each 'CDS' feature (default OFF)
--addmrna Add 'mRNA' features for each 'CDS' feature (default OFF)
--locustag [X] Locus tag prefix [auto] (default '')
--increment [N] Locus tag counter increment (default '1')
--gffver [N] GFF version (default '3')
--compliant Force Genbank/ENA/DDJB compliance: --addgenes --mincontiglen 200 --centre XXX (default OFF)
--centre [X] Sequencing centre ID. (default '')
--accver [N] Version to put in Genbank file (default '1')
Organism details:
--genus [X] Genus name (default 'Genus')
--species [X] Species name (default 'species')
--strain [X] Strain name (default 'strain')
--plasmid [X] Plasmid name or identifier (default '')
Annotations:
--kingdom [X] Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
--gcode [N] Genetic code / Translation table (set if --kingdom is set) (default '0')
--gram [X] Gram: -/neg /pos (default '')
--usegenus Use genus-specific BLAST databases (needs --genus) (default OFF)
--proteins [X] FASTA or GBK file to use as 1st priority (default '')
--hmms [X] Trusted HMM to first annotate from (default '')
--metagenome Improve gene predictions for highly fragmented genomes (default OFF)
--rawproduct Do not clean up /product annotation (default OFF)
--cdsrnaolap Allow [tr]RNA to overlap CDS (default OFF)
Computation:
--cpus [N] Number of CPUs to use [0=all] (default '8')
--fast Fast mode - only use basic BLASTP databases (default OFF)
--noanno For CDS just set /product="unannotated protein" (default OFF)
--mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
--evalue [n.n] Similarity e-value cut-off (default '1e-06')
--rfam Enable searching for ncRNAs with InfernalRfam (SLOW!) (default '0')
--norrna Don't run rRNA search (default OFF)
--notrna Don't run tRNA search (default OFF)
--rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)
(base) [user@server ~]# prokka --outdir GCA_002902505.1 --prefix GCA_002902505.1 --noanno --cpus 8 --locustag 'GCA_002902505.1|ORF' GCA_002902505.1.fna
--locustag 'GCA_002902505.1|ORF' 这样的locustag设置,可以将genome id(GCA_002902505.1)也同时标记在gene id上,便于多个基因组注释结果合并时,区分不同的序列。
注释结果如下:
(base) [user@server ~]# ls -l GCA_002902505.1
total 27M
-rw-rw-r-- 1 user user 84 Apr 14 15:37 errorsummary.val
-rw-rw-r-- 1 user user 950K Apr 14 15:37 GCA_002902505.1.err
-rw-rw-r-- 1 user user 847K Apr 14 15:37 GCA_002902505.1.faa # 预测蛋白序列
-rw-rw-r-- 1 user user 2.3M Apr 14 15:37 GCA_002902505.1.ffn # 预测核酸序列,包括CDS, rRNA, tRNA, tmRNA, misc_RNA
-rw-rw-r-- 1 user user 2.6M Apr 14 15:37 GCA_002902505.1.fna # 原基因组序列
-rw-rw-r-- 1 user user 2.6M Apr 14 15:37 GCA_002902505.1.fsa # 同上(contigs/scafolds id 不同)
-rw-rw-r-- 1 user user 5.1M Apr 14 15:37 GCA_002902505.1.gbf # genbank格式的注释结果
-rw-rw-r-- 1 user user 3.0M Apr 14 15:37 GCA_002902505.1.gff # gff格式的注释结果
-rw-rw-r-- 1 user user 7.4K Apr 14 15:37 GCA_002902505.1.log
-rw-rw-r-- 1 user user 8.1M Apr 14 15:37 GCA_002902505.1.sqn # 可用于上传NCBI GenBank的sqn格式
-rw-rw-r-- 1 user user 332K Apr 14 15:37 GCA_002902505.1.tbl # Feature Table file,可通过"tbl2asn"转sqn
-rw-rw-r-- 1 user user 124K Apr 14 15:37 GCA_002902505.1.tsv
-rw-rw-r-- 1 user user 96 Apr 14 15:37 GCA_002902505.1.txt
-rw-rw-r-- 1 user user 643K Apr 14 15:37 GCA_002902505.1.val
Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068-9. DOI:10.1093/bioinformatics/btu153