Published at 2022-04-14 10:24
Author:zhanpc
View:1413
MASH是一款快速评估基因组以及宏基因组距离的一款软件。Mash扩展了MinHash降维技术,包括成对变异距离和P值显著性测试,实现了大规模序列集合的高效聚类和搜索。Mash将大序列和序列集简化为小的、有代表性的sketches (这里我们理解为一种数据库),从中可以快速估计全局突变距离。它的使用主要包含一下几个方面:
genome1.fna和genome2.fna分表代表两个基因组,就可以使用下面命令评估他们之间的距离: 命令:
(mash) [user@server ~]# mash dist genome1.fa genome2.fna
输出结果分为5个字段Reference-ID, Query-ID, Mash-distance, P-value, and Matching-hashes
:
genome1.fna genome2.fna 0.0222766 0 456/1000
以genome1.fna作为参考基因组与其他的基因组做距离评估,那么建立genome1.fna的mash数据库后可以显著缩短计算时间。 命令:
(mash) [user@server ~]# mash sketch genome1.fna -o genome1
(mash) [user@server ~]# mash dist genome1.msh genome2.fna
通过mash sketch
命令为genome1.fna构建mash数据库,输出结果为genome1.msh,再通过mash dist
评估两个基因组之间的距离。
mash sketch
命令还可以为多个基因组构建mash数据库,节约时间的同时可以实现与多个参考数据的比较。
通过对genome1.fna 和 genome2.fna共同建立reference.msh的数据库,然后与genome3.fna进行比对。
命令:
(mash) [user@server ~]# mash sketch genome1.fna genome2.fna -o reference
(mash) [user@server ~]# mash dist reference.msh genome3.fna
输出结果如下:
genome1.fna genome3.fna 0 0 1000/1000
genome2.fna genome3.fna 0.0222766 0 456/1000
以B17这株梭菌的原始reads为例。NCBI Refseq 参考基因组的数据库:refseq.genomes.k21.s1000.msh可以在Mash官网下载。
命令:
(mash) [user@server ~]# cat B17_R1.fq B17_R2.fq > B17.fq
(mash) [user@server ~]# mash sketch B17.fq -o B17
(mash) [user@server ~]# mash dist refseq.genomes.k21.s1000.msh B17.msh
部分输出结果如下:
B17.fq GCF_000424245.1_ASM42424v1_genomic.fna.gz 0.037311 0 296/1000
B17.fq GCF_001465175.1_ASM146517v1_genomic.fna.gz 0.0383209 0 288/1000
B17.fq GCF_000355785.1_CloButy1.0_genomic.fna.gz 0.0384495 0 287/1000
B17.fq GCF_000878275.1_ASM87827v1_genomic.fna.gz 0.0384495 0 287/1000
B17.fq GCF_001456065.2_ASM145606v2_genomic.fna.gz 0.0387085 0 285/1000
一个reads数据集可能涉及多个基因组,可以利用mash数据库进行 “筛选” 与参考数据库有关的reads (Mash v2.0的新功能),反过来估计参考基因组是否在reads数据集中有所涉及。
以宏基因组样本DJ3.fastq为例子比对到refseq.genomes.k21.s1000.msh:
(mash) [user@server ~]# mash screen -p 100 refseq.genomes.k21.s1000.msh DJ3.fastq > DJ3.tab
输出结果为6个字段identity, shared-hashes, median-multiplicity, p-value, query-ID, query-comment
:
0.999522 990/1000 101 0 GCF_900086185.1_12082_4_85_genomic.fna.gz [51 seqs] NZ_FLIP01000001.1 Klebsiella pneumoniae strain k1037, whole genome shotgun sequence [...]
0.999329 986/1000 24 0 GCF_002055205.1_ASM205520v1_genomic.fna.gz [72 seqs] NZ_MYOO01000010.1 Salmonella enterica strain BCW_4904 NODE_10_length_177558_cov_3.07217, whole genome shotgun sequence [...]
0.999329 986/1000 24 0 GCF_002054075.1_ASM205407v1_genomic.fna.gz [88 seqs] NZ_MYNK01000010.1 Salmonella enterica strain BCW_4936 NODE_10_length_177385_cov_3.78874, whole genome shotgun sequence [...]
0.999329 986/1000 24 0 GCF_000474475.1_CFSAN001184_01.0_genomic.fna.gz [45 seqs] NZ_AUQM01000001.1 Salmonella enterica subsp. enterica serovar Typhimurium str. CDC_2009K1158 isolate 2009K-1158 SEET1158_1, whole genome shotgun sequence [...]
(mash) [user@server ~]# mash info all_clostridium_refer.msh #以99株梭菌的mash数据库为例
Mash数据库信息包括sketch的数量,基因组序列长度,基因组序列的contig数量等。
Header:
Hash function (seed): MurmurHash3_x64_128 (42)
K-mer size: 21 (64-bit hashes)
Alphabet: ACGT (canonical)
Target min-hashes per sketch: 1000
Sketches: 99
Sketches:
[Hashes] [Length] [ID] [Comment]
1000 4132880 endophytes_genomes/GCA_000008765.1.fna [2 seqs] AE001437.1 Clostridium acetobutylicum ATCC 824, complete genome [...]
1000 1835704 endophytes_genomes/GCM10015198.fna [32 seqs] FRBG01000001.1 [Clostridium] paradoxum JW-YL-7 = DSM 7308 genome assembly, contig: Ga0056075_scaffold00001.1, whole genome shotgun sequence [...]
......
- Ondov BD, Treangen TJ, Melsted P, et al. Mash: fast genome and metagenome distance estimation using minhash. Genome Biol. 2016;17(1):132. doi:10.1186/s13059-016-0997-x.
- Ondov BD, Starrett GJ, Sappington A, et al. Mash Screen: high-throughput sequence containment estimation for genome discovery[J]. Genome biology, 2019, 20(1).doi:10.1186/s13059-019-1841-x.