基于CAT模型（针对组分异质性）的贝叶斯树构建——PhyloBayes MPI

Published at 2020-06-24 17:15

Author：zhixy

CAT模型

RAxML也用CAT模型来描述在不同位点上的替代速率异质性，但PhyloBayes的CAT不同于RAxML的CAT。因为，PhyloBayes的CAT考虑了同一状态（碱基/氨基酸）在不同对齐位点上可以有不同的概率变化为其它状态（碱基/氨基酸）。举例来说，在某一位置上的精氨酸有很高的概率可以变为组氨酸，而在另一个位置上却有很高的概率变为赖氨酸。

因此，PhyloBayes的CAT模型实际上是考虑了在不同位点上的组分异质性的，而RAxML的CAT模型并没有，仍然是均质性的模型。

PhyloBayes MPI

PhyloBayes实现基于CAT模型的MCMC(Markov Chain Monte Carlo，马尔科夫链蒙特卡洛方法)算法，计算时间非常高。因此其作者多年后推出了MPI并行版本。但即使是并行版本，对于大体量数据，如：源自基因组的多基因数据集合，也相当耗时。建议在运行该软件前，进行Taxon Sampling，在保证物种多样性的前提下，尽可能的缩小数据量。

安装

PhyloBayes的官网（www.phylobayes.org）现不能访问。但在Github上可下载源码（git clone）。

进入sources目录执行make编译(系统需要有支持MPI的环境mpicc)后，在data目录中会生成PhyloBayes MPI的可执行程序，包括：

pb_mpi，mcmc采样器。每次运行将产生一条mcmc链。
bpcomp，评估两个或多个独立mcmc链之间的二分频率的差异，并通过汇集所有正在比较的mcmc链的树来计算一致树。
tracecomp，根据跟踪文件中提供的摘要变量评估两个或多个独立mcmc链之间的差异。
readpb_mpi，后分析程序，用于计算各种后验平均值。

可将以上程序拷贝至可执行程序目录(如/usr/local/bin)，方便调用。

用法

MPI版本的程序，需要在MPI环境下执行，形式上即通过mpirun来运行相应程序，如：

(base) [user@server ~]＄ mpirun -np 10 pb_mpi

mpirun -np <np> pb_mpi -d <datafile> [options] <chainname>
        creates a new chain, sampling from the posterior distribution, conditional on specified data

mpirun -np <np> pb_mpi <chainname>
        starts an already existing chain

        mpirun -np <np>     : number of parallel processes (should be at least 2)

        -cat -dp            : infinite mixture (Dirichlet process) of equilibirium frequency profiles
        -ncat <ncat>        : finite mixture of equilibirium frequency profiles
        -catfix <pr>        : specifying a fixed pre-defined mixture of profiles

        -lg                 : Le and Gascuel 2008
        -wag                : Whelan and Goldman 2001
        -jtt                : Jones, Taylor, Thornton 1992
        -gtr                : general time reversible
        -poisson            : Poisson matrix, all relative exchangeabilities equal to 1 (Felsenstein 1981)

        -dgam <ncat>        : discrete gamma. ncat = number of categories (4 by default, 1 = uniform rates model)

        -dc                 : excludes constant columns
        -t <treefile>       : starts from specified tree
        -T <treefile>       : chain run under fixed, specified tree

        -x <every> <until>  : saving frequency, and chain length (until = -1 : forever)
        -f                  : forcing checks
        -s/-S               : -s : save all / -S : save only the trees


see manual for details

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[62923,1],3]
  Exit code:    1
--------------------------------------------------------------------------

mpirun的主要参数 -np 设定了运行程序时调用的cpu数量。

在以上显示信息中，可以看到pb_mpi有两个必要参数：

-d <datafile>，通过-d指定序列比对文件（phylip格式）
<chainname>，指定mcmc链的名字

此外，关于模型的参数都是可选的[options]。下面通过软件包例的示例序列文件brpo.ali(为节省时间，仅保留了前10条序列)来演示bayes树的构建。

(base) [user@server ~]＄ mpirun -np 10 pb_mpi -d brpo.ali run1 & #调用10 cpus，运行mcmc方法，命名为run1，并至于后台
(base) [user@server ~]＄ mpirun -np 10 pb_mpi -d brpo.ali run2 & #调用10 cpus，运行mcmc方法，命名为run2，并至于后台

启动两个mcmc链，程序会一致运行下去。终止程序可通过单线程版PhyloBayes中的stoppb程序来终止，或将后台运行的程序移至（ctl f）前台后ctl c终止。

什么时候可以终止mcmc链的运行？，这是贝叶斯方法构建进化树的一个关键问题。

mcmc方法的本质是，从一个参数的先验分布开始，通过评价对观测数据（在系统发育分析中，即比对序列）的拟合，将先验分布逐渐调整为对观测数据拟合程度的最高的后验分布。

因此，当后验分布不能再更好的拟合观测数据时，也就是mcmc链已经收敛（Convergence）。PhyloBayes MPI检验收敛的方法是通过bpcomp程序完成的。

(base) [user@server ~]＄ bpcomp -x 100 10 run1 run2
run1.treelist : 3071 trees
run2.treelist : 3055 trees

maxdiff     : 0.0462656 
meandiff    : 0.00696971

bipartition list in : bpcomp.bplist
consensus in        : bpcomp.con.tre

-x 100 10: 去除前100个数据点（burn-in），此后每10个点采集一次数据。该参数的设定可根据mcmc链的长度来调整。

其中的关键指标maxdiff:

maxdiff < 0.1: 收敛的结果;
maxdiff < 0.3: 可接受的结果;
0.3 < maxdiff < 1: 未收敛的结果；
maxdiff = 1: 即使在10000个数据点之后，这表明至少有一个mcmc链卡在局部最大值中，即无法达到全局最大值。

如上例所示，两个mcmc链收敛，可终止；bpcomp产生的结果bpcomp.con.tre，即我们所需要的贝叶斯一致树。

模型的设定

在pd_mpi程序中，可通过以下参数来设定不同进化模型。

组分频率向量：

-cat -dp : infinite mixture (Dirichlet process) of equilibirium frequency profiles
-ncat <ncat> : finite mixture of equilibirium frequency profiles
-catfix <pr>/<filename> : specifying a fixed pre-defined mixture of profiles

The Dirichlet process usually has a much better mixing behavior. -cat优于-ncat。

替代速率矩阵：

-lg : Le and Gascuel 2008 # LG
-wag : Whelan and Goldman 2001 # WAG
-jtt : Jones, Taylor, Thornton 1992 # JTT
-gtr : general time reversible # GTR
-poisson : Poisson matrix, all relative exchangeabilities equal to 1 (Felsenstein 1981)

位点间速率异质性:

-dgam <n> : discrete gamma. ncat = number of categories (4 by default, 1 = uniform rates model)

所以用CATGTRG4模型启动mcmc链的命令参数如下：

(base) [user@server ~]＄ mpirun -np 10 pb_mpi -d brpo.ali -cat -gtr -dgam 4 run3

后验预测多样性检验

后验预测多样性检验 (posterior predictive diversity test)，可用于评价建树模型对观测数据（真实比对序列）的拟合程度。

(base) [user@server ~]＄  mpirun -np 10 readpb_mpi -div -x 100 10 run2
read data from file : brpo.ali
number of taxa  : 10
number of sites : 599
number of states: 20
burnin: 100
every 10 points until 3819
....................................................................................................................................................................................
....................................................................................................................................................................................
............

result of diversity test in run2.div

(base) [user@server ~]＄ cat run2.div 
diversity test
obs div : 2.55092
mean div: 2.55794 /- 0.044931
z-score : 0.156314
pp      : 0.432796 # p值

当p值小于0.05时，根据模型模拟的数据分布于真是观测数据有显著差异，即模型不能解释真实数据(model fit test 失败)。

参考文献

Lartillot N., Philippe H. A Bayesian Mixture Model for Across-Site Heterogeneities in the Amino-Acid Replacement Process. Mol Biol Evol. 2004 21(6):1095-1109. DOI:10.1093/molbev/msh112.

Lartillot N, Rodrigue N, Stubbs D, Richer J. PhyloBayes MPI: Phylogenetic Reconstruction With Infinite Mixtures of Profiles in a Parallel Environment. Syst Biol. 2013 Jul;62(4):611-5. doi:10.1093/sysbio/syt022.

Columns

Python ICNP Others R Linux Phylogenetics Phylogenomics Genomics Reference Evolution Bioinformatics Protocols Metagenomics Statistics MinIO—分布式对象存储服务器

A Lab of Microbial Systematics and Evolution