利用MrBayes构建Bayes树

Published at 2020-06-29 19:44

Author：zhixy

MrBayes简介

MrBayes是最早的也是最优的利用贝叶斯法构建进化树的软件之一。

推荐用conda安装，但MrBayes的MPI版本需要下载源码编译安装。MrBayes采用交互模式，在执行其可执行命令mb或mb-mpi后，即可进入MrBayes软件环境，然后我们需要通过输入一些命令来完成贝叶斯分析。

(base) [user@server ~]$ mb 


                            MrBayes 3.2.7a x86_64

                      (Bayesian Analysis of Phylogeny)

              Distributed under the GNU General Public License


               Type "help" or "help <command>" for information
                     on the commands that are available.

                   Type "about" for authorship and general
                       information about the program.


MrBayes >

mb-mpi 的执行需用 mpirun -n 8 mb-mpi，需要注意的是并行版本的mrbayes，并不能随意设置核心数，需要根据mcmc链的数量设置。

读取输入文件

MrBayes只能读取Nexus格式的序列比对文件。

其他可是的比对文件可通过trimAL软件包中的readAL程序进行格式转换。

MrBayes > execute example.nex # execute 读取命令

   Executing file "example.nex"
   UNIX line termination
   Longest line length = 919
   Parsing file
   Expecting NEXUS formatted file
   Reading data block
      Allocated taxon set
      Allocated matrix
      Defining new matrix with 12 taxa and 898 characters
      Data is Dna
      Data matrix is not interleaved
      Gaps coded as -
      Taxon  1 -> Tarsius_syrichta
      Taxon  2 -> Lemur_catta
      Taxon  3 -> Homo_sapiens
      Taxon  4 -> Pan
      Taxon  5 -> Gorilla
      Taxon  6 -> Pongo
      Taxon  7 -> Hylobates
      Taxon  8 -> Macaca_fuscata
      Taxon  9 -> M_mulatta
      Taxon 10 -> M_fascicularis
      Taxon 11 -> M_sylvanus
      Taxon 12 -> Saimiri_sciureus
      Successfully read matrix
      Setting default partition (does not divide up characters)
      Setting model defaults
      Seed (for generating default start values) = 1593432009
      Setting output file names to "example.nex.run<i>.<p|t>"
   Exiting data block
   Reached end of file

MrBayes >

模型设置

Mrbayes的模型设置主要通过lset和prset两个命令完成，此外执行showmodel可以显示当前模型设置的情况。

MrBayes > showmodel

   Model settings:

      Data not partitioned --
         Datatype  = DNA
         Nucmodel  = 4by4
         Nst       = 1
         Covarion  = No
         # States  = 4
                     State frequencies have a Dirichlet prior
                     (1.00,1.00,1.00,1.00)
         Rates     = Equal

   Active parameters: 

      Parameters
      ---------------------
      Statefreq           1
      Ratemultiplier      2
      Topology            3
      Brlens              4
      ---------------------

      1 --  Parameter  = Pi
            Type       = Stationary state frequencies
            Prior      = Dirichlet

      2 --  Parameter  = Ratemultiplier
            Type       = Partition-specific rate multiplier
            Prior      = Fixed(1.0)

      3 --  Parameter  = Tau
            Type       = Topology
            Prior      = All topologies equally probable a priori
            Subparam.  = V

      4 --  Parameter  = V
            Type       = Branch lengths
            Prior      = Unconstrained:GammaDir(1.0,0.1000,1.0,1.0)


MrBayes >

`Nucmodel` 核苷酸数据模式

4by4 针对一般的DNA序列
Doublet 针对核糖体RNA基因配对茎区
Codon 针对以密码子方式看待的DNA序列
Protein 针对编码蛋白质的DNA序列

`Nst` 替代模型的基本接哦股设定

1 替代矩阵中仅有一个单一的替代速率参数（JC模型）
2 替代矩阵中有两个替代速率参数，分别针对颠换和转换（K2P模型）
6 替代矩阵中有六个替代速率参数（GTR模型）
Mixed

`Rates` 位点间速率异质性

Equal 等速率，即无位点间替代速率差异
Gamma 位点替代速率差异用Gamma分布描述（G）
LNorm 位点替代速率差异用对数正态分布描述
Propinv 不变位点比例参数（I）
Invgamma 位点替代速率差异用Gamma分布描述，并加以不变位点比例参数（I G）
Adgamma
Kmixture 位点替代速率差异用K组的Dirichlet分布描述

`Ngammacat` 离散Gamma分布的类别数（一般为4组）

以上模型参数的设定需要lset命令

MrBayes > lset Nst=6 Rates=Invgamma Ngammacat=4

   Setting Nst to 6
   Setting Rates to Invgamma
   Setting Ngammacat to 4
   Successfully set likelihood model parameters

MrBayes >

在进化模型中，除了替代速率矩阵外，还有一项重要的内容，即组分频率向量。

`statefreqpr` 组分频率先验分布

Dirichlet Dirichlet先验分布
Fixed 固定频率（JC，SYM模型）

区别于替代速率矩阵相关参数的设定，组分频率的设定需要通过prset命令完成。事实上prset是针对所有相关参数的先验分布设定的。

MCMC模拟

模型设定完成后，接下来就可以启动MCMC模拟了。MrBayes启动MCMC需要命令mcmc，而与MCMC有关的参数需要命令mcmcp来设定。

Ngen mcmc模拟的代数
Nruns 默认值2，独立运行run数量
Nchains 默认值4，每个run的马尔可夫链数量
Samplefreq 默认值500，数据采样频率
Printfreq 默认值1000，屏幕输出频率
Stoprule 默认值No，当设为Yes时，mcmc会根据监测收否已到达收敛参数而自动停止运行，否则会一致运行到Ngen所设置的代数
Stopval 当Stoprule=yes时，收敛参数达到该参数设定阈值时，mcmc即停止。

MrBayes > mcmcp Ngen=1500000 Nruns=2 Nchains=4 Samplefreq=1000 Stoprule=yes Stopval=0.01

   Setting number of generations to 1500000
   Setting number of runs to 2
   Setting number of chains to 4
   Setting sample frequency to 1000
   Using stopping rule.
   Setting burnin fraction to 0.25
   Setting chain output file names to "example.nex.run<i>.<p/t>"
   Successfully set chain parameters

MrBayes > mcmc

结果评价与输出

Mrbayes推荐以Average standard deviation of split frequencies < 0.01来判断MCMC的收敛程度，即结果优良程度。所以当模拟代数已经完成，而Average standard deviation of split frequencies已经小于0.01时，即可终止MCMC。

`sump` 汇总数据样本中的模型参数

MrBayes > sump

   Summarizing parameters in files example.nex.run1.p and example.nex.run2.p
   Writing summary statistics to file example.nex.pstat
   Using relative burnin ('relburnin=yes'), discarding the first 25 % of samples

   Below are rough plots of the generation (x-axis) versus the log   
   probability of observing the data (y-axis). You can use these     
   graphs to determine what the burn in for your analysis should be. 
   When the log probability starts to plateau you may be at station- 
   arity. Sample trees and parameters after the log probability      
   plateaus. Of course, this is not a guarantee that you are at sta- 
   tionarity. Also examine the convergence diagnostics provided by   
   the 'sump' and 'sumt' commands for all the parameters in your     
   model. Remember that the burn in is the number of samples to dis- 
   card. There are a total of ngen / samplefreq samples taken during 
   a MCMC analysis.                                                  

   Overlay plot for both runs:
   (1 = Run number 1; 2 = Run number 2; * = Both runs)

   ------------------------------------------------------------ -5719.47
   |                            1                               |
   |                                                            |
   |                                                          1 |
   |       2                                   1                |
   |                      2         2        1  11     1        |
   |    1    1   122   1            12    222      1 1    1     |
   |**1     2   1      22    1        2 *    21   2       2  1 1|
   |      1     2  1221 121 *2  2    1   1    22  12  *  *     2|
   |           2  1  12       22 1    1  2  1                   |
   |   12    22  2  1      2       2       1     2  2  2   1  2 |
   |      2    1         1       2 1            2   1   *       |
   |  2  *  1                  1  2    2  1                     |
   |       1               1  1   1                  2     22   |
   |          1                        1                    12  |
   |   2                                                        |
   --------------------------------------------------- -5728.29
   ^                                                            ^
   125000                                                       500000


   Estimated marginal likelihoods for runs sampled in files
      "example.nex.run1.p" and "example.nex.run2.p":
      (Use the harmonic mean for Bayes factor comparisons of models)

      (Values are saved to the file example.nex.lstat)

   Run   Arithmetic mean   Harmonic mean
   --------------------------------------
     1      -5719.39         -5734.56
     2      -5719.30         -5735.78
   --------------------------------------
   TOTAL    -5719.35         -5735.34
   --------------------------------------


   Model parameter summaries over the runs sampled in files
      "example.nex.run1.p" and "example.nex.run2.p":
      Summaries are based on a total of 752 samples from 2 runs.
      Each run produced 501 samples of which 376 samples were included.
      Parameter summaries saved to file "example.nex.pstat".

                                         95% HPD Interval
                                       --------------------
   Parameter      Mean      Variance     Lower       Upper       Median    min ESS*  avg ESS    PSRF 
   --------------------------------------------------------------------------------------------------
   TL          3.184384    0.106319    2.607986    3.855185    3.165946    206.80    249.29    0.999
   r(A<->C)    0.041903    0.000071    0.025691    0.057371    0.041786    195.65    218.26    0.999
   r(A<->G)    0.492460    0.002216    0.403166    0.583745    0.494164    136.48    157.84    1.000
   r(A<->T)    0.035249    0.000064    0.020791    0.051518    0.034684    198.30    241.48    1.003
   r(C<->G)    0.029333    0.000161    0.008188    0.054821    0.028833    242.20    279.11    1.001
   r(C<->T)    0.381829    0.001809    0.297725    0.459089    0.381034     47.86    115.95    1.001
   r(G<->T)    0.019225    0.000167    0.000066    0.042855    0.016916    211.99    222.03    1.000
   pi(A)       0.354387    0.000158    0.331452    0.379060    0.354105    270.03    282.43    0.999
   pi(C)       0.321965    0.000114    0.302611    0.342663    0.321753    205.51    256.60    0.999
   pi(G)       0.079765    0.000044    0.067832    0.093384    0.079435    151.46    192.30    1.000
   pi(T)       0.243882    0.000115    0.222965    0.263219    0.243739    272.79    286.54    1.000
   alpha       0.599274    0.033792    0.341705    0.945335    0.561036    169.28    180.25    0.999
   pinvar      0.145666    0.006536    0.000059    0.270895    0.146287    155.67    170.06    0.999
   --------------------------------------------------------------------------------------------------
   * Convergence diagnostic (ESS = Estimated Sample Size); min and avg values
     correspond to minimal and average ESS among runs. 
     ESS value below 100 may indicate that the parameter is undersampled. 
    Convergence diagnostic (PSRF = Potential Scale Reduction Factor; Gelman
     and Rubin, 1992) should approach 1.0 as runs converge.
MrBayes >

`sumt` 汇总样本树，并生成一致树

MrBayes > sumt

   Summarizing trees in files "example.nex.run1.t" and "example.nex.run2.t"
   Using relative burnin ('relburnin=yes'), discarding the first 25 % of sampled trees
   Writing statistics to files example.nex.<parts|tstat|vstat|trprobs|con>
   Examining first file ...
   Found one tree block in file "example.nex.run1.t" with 501 trees in last block
   Expecting the same number of trees in the last tree block of all files

   Tree reading status:

   0      10      20      30      40      50      60      70      80      90     100
   v-------v-------v-------v-------v-------v-------v-------v-------v-------v-------v
   *********************************************************************************

   Read a total of 1002 trees in 2 files (sampling 752 of them)
      (Each file contained 501 trees of which 376 were sampled)

   General explanation:                                                          

   In an unrooted tree, a taxon bipartition (split) is specified by removing a   
   branch, thereby dividing the species into those to the left and those to the  
   right of the branch. Here, taxa to one side of the removed branch are denoted 
   '.' and those to the other side are denoted '*'. Specifically, the '.' symbol 
   is used for the taxa on the same side as the outgroup.                        

   In a rooted or clock tree, the tree is rooted using the model and not by      
   reference to an outgroup. Each bipartition therefore corresponds to a clade,  
   that is, a group that includes all the descendants of a particular branch in  
   the tree.  Taxa that are included in each clade are denoted using '*', and    
   taxa that are not included are denoted using the '.' symbol.                  

   The output first includes a key to all the bipartitions with frequency larger 
   or equual to (Minpartfreq) in at least one run. Minpartfreq is a parameter to 
   sumt command and currently it is set to 0.10.  This is followed by a table  
   with statistics for the informative bipartitions (those including at least    
   two taxa), sorted from highest to lowest probability. For each bipartition,   
   the table gives the number of times the partition or split was observed in all
   runs (#obs) and the posterior probability of the bipartition (Probab.), which 
   is the same as the split frequency. If several runs are summarized, this is   
   followed by the minimum split frequency (Min(s)), the maximum frequency       
   (Max(s)), and the standard deviation of frequencies (Stddev(s)) across runs.  
   The latter value should approach 0 for all bipartitions as MCMC runs converge.

   This is followed by a table summarizing branch lengths, node heights (if a    
   clock model was used) and relaxed clock parameters (if a relaxed clock model  
   was used). The mean, variance, and 95 % credible interval are given for each 
   of these parameters. If several runs are summarized, the potential scale      
   reduction factor (PSRF) is also given; it should approach 1 as runs converge. 
   Node heights will take calibration points into account, if such points were   
   used in the analysis.                                                         

   Note that Stddev may be unreliable if the partition is not present in all     
   runs (the last column indicates the number of runs that sampled the partition 
   if more than one run is summarized). The PSRF is not calculated at all if     
   the partition is not present in all runs.The PSRF is also sensitive to small  
   sample sizes and it should only be considered a rough guide to convergence    
   since some of the assumptions allowing one to interpret it as a true potential
   scale reduction factor are violated in MrBayes.                               

   List of taxa in bipartitions:                                                 

      1 -- Tarsius_syrichta
      2 -- Lemur_catta
      3 -- Homo_sapiens
      4 -- Pan
      5 -- Gorilla
      6 -- Pongo
      7 -- Hylobates
      8 -- Macaca_fuscata
      9 -- M_mulatta
     10 -- M_fascicularis
     11 -- M_sylvanus
     12 -- Saimiri_sciureus

   Key to taxon bipartitions (saved to file "example.nex.parts"):

   ID -- Partition
   ------------------
    1 -- .***********
    2 -- .*..........
    3 -- ..*.........
    4 -- ...*........
    5 -- ....*.......
    6 -- .....*......
    7 -- ......*.....
    8 -- .......*....
    9 -- ........*...
   10 -- .........*..
   11 -- ..........*.
   12 -- ...........*
   13 -- ..***.......
   14 -- ..*********.
   15 -- ..**********
   16 -- .......***..
   17 -- .......****.
   18 -- .......**...
   19 -- ..****......
   20 -- ..*****.....
   21 -- ..**........
   ------------------

   Summary statistics for informative taxon bipartitions
      (saved to file "example.nex.tstat"):

   ID   #obs    Probab.     Sd(s)      Min(s)      Max(s)   Nruns 
   ----------------------------------------------------------------
   13   752    1.000000    0.000000    1.000000    1.000000    2
   14   752    1.000000    0.000000    1.000000    1.000000    2
   15   752    1.000000    0.000000    1.000000    1.000000    2
   16   752    1.000000    0.000000    1.000000    1.000000    2
   17   752    1.000000    0.000000    1.000000    1.000000    2
   18   752    1.000000    0.000000    1.000000    1.000000    2
   19   752    1.000000    0.000000    1.000000    1.000000    2
   20   752    1.000000    0.000000    1.000000    1.000000    2
   21   751    0.998670    0.001881    0.997340    1.000000    2
   ----------------------------------------------------------------
    Convergence diagnostic (standard deviation of split frequencies)
     should approach 0.0 as runs converge.


   Summary statistics for branch and node parameters
      (saved to file "example.nex.vstat"):

                                           95% HPD Interval
                                         --------------------
   Parameter      Mean       Variance     Lower       Upper       Median     PSRF  Nruns
   --------------------------------------------------------------------------------------
   length[1]     0.541456    0.009519    0.353274    0.717554    0.537658    0.999    2
   length[2]     0.372879    0.005509    0.238197    0.516832    0.364400    1.001    2
   length[3]     0.051215    0.000146    0.028626    0.075192    0.050526    0.999    2
   length[4]     0.064156    0.000159    0.042227    0.089955    0.063130    0.999    2
   length[5]     0.061693    0.000241    0.034584    0.092876    0.059972    0.999    2
   length[6]     0.153199    0.000843    0.102179    0.209389    0.151336    1.000    2
   length[7]     0.180160    0.001205    0.116956    0.247668    0.178352    1.002    2
   length[8]     0.017412    0.000035    0.007366    0.029919    0.016864    1.001    2
   length[9]     0.023680    0.000046    0.011245    0.037230    0.022899    1.000    2
   length[10]    0.059184    0.000154    0.037673    0.085980    0.058277    0.999    2
   length[11]    0.075561    0.000418    0.038818    0.117051    0.074517    1.013    2
   length[12]    0.476301    0.006387    0.326916    0.635882    0.464266    1.000    2
   length[13]    0.089409    0.000513    0.049164    0.133348    0.087979    0.999    2
   length[14]    0.130533    0.002517    0.034572    0.221818    0.126364    0.999    2
   length[15]    0.301198    0.005515    0.159970    0.446587    0.296882    1.001    2
   length[16]    0.048544    0.000358    0.015916    0.086822    0.046766    1.008    2
   length[17]    0.274635    0.002832    0.179287    0.388344    0.270745    0.999    2
   length[18]    0.035885    0.000108    0.017321    0.057875    0.035077    0.999    2
   length[19]    0.059240    0.000540    0.019800    0.104376    0.057129    1.001    2
   length[20]    0.137187    0.001602    0.057767    0.212151    0.132745    1.000    2
   length[21]    0.030895    0.000147    0.009996    0.055880    0.029637    1.004    2
   --------------------------------------------------------------------------------------
    Convergence diagnostic (PSRF = Potential Scale Reduction Factor; Gelman
     and Rubin, 1992) should approach 1.0 as runs converge. NA is reported when
     deviation of parameter values within all runs is 0 or when a parameter
     value (a branch length, for instance) is not sampled in all runs.


   Summary statistics for partitions with frequency >= 0.10 in at least one run:
       Average standard deviation of split frequencies = 0.000209
       Maximum standard deviation of split frequencies = 0.001881
       Average PSRF for parameter values (excluding NA and >10.0) = 1.001
       Maximum PSRF for parameter values = 1.013


   Clade credibility values:

   /---------------------------------------------------------- Tarsius_syrichta (1)
   |                                                                               
   |---------------------------------------------------------- Lemur_catta (2)
   |                                                                               
   |                                                 /-------- Homo_sapiens (3)
   |                                        /---100--                             
   |                                        |        \-------- Pan (4)
   |                                /--100--                                      
   |                                |       \----------------- Gorilla (5)
   |                        /--100--                                              
                           |       \------------------------- Pongo (6)
   |                /--100--                                                      
   |                |       \--------------------------------- Hylobates (7)
   |                |                                                              
   |                |                                /-------- Macaca_fuscata (8)
   |       /---100--                       /---100--                             
   |       |        |                       |        \-------- M_mulatta (9)
   |       |        |               /--100--                                      
   |       |        |               |       \----------------- M_fascicularis (10)
   \--100--        \------100------                                              
           |                        \------------------------- M_sylvanus (11)
           |                                                                       
           \-------------------------------------------------- Saimiri_sciure~ (12)


   Phylogram (based on average branch lengths):

   /---------------------------------------- Tarsius_syrichta (1)
   |                                                                               
   |--------------------------- Lemur_catta (2)
   |                                                                               
   |                                                     /---- Homo_sapiens (3)
   |                                                   /-                         
   |                                                   | \----- Pan (4)
   |                                            /------                           
   |                                            |      \---- Gorilla (5)
   |                                        /---                                  
                                           |   \----------- Pongo (6)
   |                              /---------                                      
   |                              |         \------------- Hylobates (7)
   |                              |                                                
   |                              |                         /-- Macaca_fuscata (8)
   |                     /--------                       /-                      
   |                     |        |                       | \-- M_mulatta (9)
   |                     |        |                   /---                        
   |                     |        |                   |   \---- M_fascicularis (10)
   \---------------------        \-------------------                            
                         |                            \------ M_sylvanus (11)
                         |                                                         
                         \---------------------------------- Saimiri_sciure~ (12)

   |-------------| 0.200 expected changes per site

   Calculating tree probabilities...

   Credible sets of trees (2 trees sampled):
      99 % credible set contains 1 tree

MrBayes >

example.nex.con.tre 最终所得Bayes树

参考文献

Huelsenbeck, J. P. and F. Ronquist. 2001. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754-755. DOI: 10.1093/bioinformatics/17.8.754

Ronquist, F. and J. P. Huelsenbeck. 2003. Mrbayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572-1574. DOI: 10.1093/bioinformatics/btg180

Ronquist, F., M. Teslenko, P. van der Mark, D. Ayres, A. Darling, S. Höhna, B. Larget, L. Liu, M. A. Suchard, and J. P. Huelsenbeck. 2012b. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Systematic Biology 61:539-542. DOI: https://doi.org/10.1093/sysbio/sys029

Columns

Python ICNP Others R Linux Phylogenetics Phylogenomics Genomics Reference Evolution Bioinformatics Protocols Metagenomics Statistics MinIO—分布式对象存储服务器

A Lab of Microbial Systematics and Evolution

利用MrBayes构建Bayes树

MrBayes简介

读取输入文件

模型设置

`Nucmodel` 核苷酸数据模式

`Nst` 替代模型的基本接哦股设定

`Rates` 位点间速率异质性

`Ngammacat` 离散Gamma分布的类别数（一般为4组）

`statefreqpr` 组分频率先验分布

MCMC模拟

结果评价与输出

`sump` 汇总数据样本中的模型参数

`sumt` 汇总样本树，并生成一致树

参考文献

Columns

Search Document

A Lab of Microbial Systematics and Evolution

利用MrBayes构建Bayes树

MrBayes简介

读取输入文件

模型设置

Nucmodel 核苷酸数据模式

Nst 替代模型的基本接哦股设定

Rates 位点间速率异质性

Ngammacat 离散Gamma分布的类别数 （一般为4组）

statefreqpr 组分频率先验分布

MCMC模拟

结果评价与输出

sump 汇总数据样本中的模型参数

sumt 汇总样本树，并生成一致树

参考文献

Columns

Search Document

`Nucmodel` 核苷酸数据模式

`Nst` 替代模型的基本接哦股设定

`Rates` 位点间速率异质性

`Ngammacat` 离散Gamma分布的类别数（一般为4组）

`statefreqpr` 组分频率先验分布

`sump` 汇总数据样本中的模型参数

`sumt` 汇总样本树，并生成一致树