基因组序列拼接后的补洞——GapFiller

Published at 2020-05-08 13:54

Author:zhixy

View:752


简介

与GapCloser一样,GapFiller也是对基因组拼接Scaffolds结果,进行补洞的软件。

获取程序需向作者提交申请(https://www.baseclear.com/services/bioinformatics/basetools/gapfiller/) ,获得源码后编译安装。

参数解释

(base) [user@server ~]# perl /usr/bio/GapFiller/GapFiller.pl
ERROR: Parameter -l is required. Please insert a library file
ERROR: Parameter -s is required. Please insert a scaffold fastA file

Usage: /usr/bio/GapFiller/GapFiller.pl [GapFiller_v1-10]

============ General Parameters ============
-l  Library file containing two paired-read files with insert size, error and orientation indication. # 配置文件
-s  Fasta file containing scaffold sequences used for extension. # 拼接Scaffolds结果
============ Extension Parameters ============
-m  Minimum number of overlapping bases with the edge of the gap (default -m 29) 
# 和gap边缘重叠的最小碱基数,该数值最好设置比reads的长度小一点点的数。比如150bp长度的reads,设置该值为140~149.
-o  Minimum number of reads needed to call a base during an extension (default -o 2)
# 在补洞时,延伸一个碱基最小需要的reads数.
-r  Percentage of reads that should have a single nucleotide extension in order to close a gap in a scaffold (Default: 0.7)
# 在补洞时,至少有该比例reads的碱基一致,才能对该碱基位点进行延伸。
-d  Maximum difference between the gapsize and the number of gapclosed nucleotides. Extension is stopped if it matches this parameter  gap size (default -d 50, optional).
# gap部分序列的允许的最大差异。填补gap后,若值“填补上的序列长度 - gap长度”大于该阈值,则停止补洞;若小于该阈值,则不进行融合。
-n  Minimum overlap required between contigs to merge adjacent sequences in a scaffold (default -n 10, optional)
# 在一个scaffold中对邻近的两个contigs进行融合所需要最小重叠的碱基数。
-t  Number of reads to trim off the start and begin of the sequence (usually missambled/low-coverage reads) (default -t 10, optional)
# 由于gap边缘的碱基大部分是低质量碱基,补洞时需要先将gap边缘该数目的碱基trim掉,作为N处理。
-i  Number of iterations to fill the gaps (default -i 10, optional)
# 迭代的最大次数。
============ Bowtie Parameters ============
-g  Maximum number of allowed gaps during mapping with Bowtie. Corresponds to the -v option in Bowtie. (default -g 1, optional)
============ Additional Parameters ============
-T  Number of threads to run (default -T 1)
# 计算核心/线程数
-S  Skip reading of the input files again
-b Base name for your output files (optional)
# 输出文件夹名

配置文件Library file

-l参数所指向的library文件需要先行编辑好。该文件包含7列,每一列之间以空格隔开。示例如下:

Lib1 bwa file1.1.fastq file1.2.fastq 400 0.25 FR
  • 第1列:library名称;
  • 第2列:使用的序列比对方法,如果reads长度<50,则使用bowtie;若长度>50并<150,则使用bwa;若长度很大,比如454的reads,则使用bwa。BWA和BWA-sw运行在默认模式下;
  • 第3,4列:双末端测序的fastq文件或fasta文件;
  • 第5列:插入片段的长度;
  • 第6列:插入片段承认的长度。比如上例子中插入片段长度为400bp,成对的reads的片段长度只有在[400-4000.25,4004000.25]范围内即被承认。
  • 第7列:双端测序reads的方向,有FF,FR,RF和RR几种。

执行GapFiller

(base) [user@server ~]# perl /usr/bio/GapFiller/GapFiller.pl -l libraries.txt -s scaffolds.fa -m 140 -T 100 -b scaffolds_gapfilled.fa

参考文献

Boetzer M, Pirovano W. Toward almost closed genomes with GapFiller. Genome Biol. 2012 Jun 25;13(6):R56. DOI:10.1186/gb-2012-13-6-r56