Published at 2020-06-29 19:44
Author:zhixy
View:1363
MrBayes是最早的也是最优的利用贝叶斯法构建进化树的软件之一。
推荐用conda
安装,但MrBayes的MPI版本需要下载源码编译安装。MrBayes采用交互模式,在执行其可执行命令mb
或mb-mpi
后,即可进入MrBayes软件环境,然后我们需要通过输入一些命令来完成贝叶斯分析。
(base) [user@server ~]$ mb
MrBayes 3.2.7a x86_64
(Bayesian Analysis of Phylogeny)
Distributed under the GNU General Public License
Type "help" or "help <command>" for information
on the commands that are available.
Type "about" for authorship and general
information about the program.
MrBayes >
mb-mpi
的执行需用mpirun -n 8 mb-mpi
,需要注意的是并行版本的mrbayes,并不能随意设置核心数,需要根据mcmc链的数量设置。
MrBayes只能读取Nexus格式的序列比对文件。
其他可是的比对文件可通过trimAL软件包中的readAL程序进行格式转换。
MrBayes > execute example.nex # execute 读取命令
Executing file "example.nex"
UNIX line termination
Longest line length = 919
Parsing file
Expecting NEXUS formatted file
Reading data block
Allocated taxon set
Allocated matrix
Defining new matrix with 12 taxa and 898 characters
Data is Dna
Data matrix is not interleaved
Gaps coded as -
Taxon 1 -> Tarsius_syrichta
Taxon 2 -> Lemur_catta
Taxon 3 -> Homo_sapiens
Taxon 4 -> Pan
Taxon 5 -> Gorilla
Taxon 6 -> Pongo
Taxon 7 -> Hylobates
Taxon 8 -> Macaca_fuscata
Taxon 9 -> M_mulatta
Taxon 10 -> M_fascicularis
Taxon 11 -> M_sylvanus
Taxon 12 -> Saimiri_sciureus
Successfully read matrix
Setting default partition (does not divide up characters)
Setting model defaults
Seed (for generating default start values) = 1593432009
Setting output file names to "example.nex.run<i>.<p|t>"
Exiting data block
Reached end of file
MrBayes >
Mrbayes的模型设置主要通过lset
和prset
两个命令完成,此外执行showmodel
可以显示当前模型设置的情况。
MrBayes > showmodel
Model settings:
Data not partitioned --
Datatype = DNA
Nucmodel = 4by4
Nst = 1
Covarion = No
# States = 4
State frequencies have a Dirichlet prior
(1.00,1.00,1.00,1.00)
Rates = Equal
Active parameters:
Parameters
---------------------
Statefreq 1
Ratemultiplier 2
Topology 3
Brlens 4
---------------------
1 -- Parameter = Pi
Type = Stationary state frequencies
Prior = Dirichlet
2 -- Parameter = Ratemultiplier
Type = Partition-specific rate multiplier
Prior = Fixed(1.0)
3 -- Parameter = Tau
Type = Topology
Prior = All topologies equally probable a priori
Subparam. = V
4 -- Parameter = V
Type = Branch lengths
Prior = Unconstrained:GammaDir(1.0,0.1000,1.0,1.0)
MrBayes >
Nucmodel
核苷酸数据模式Nst
替代模型的基本接哦股设定Rates
位点间速率异质性Ngammacat
离散Gamma分布的类别数 (一般为4组)以上模型参数的设定需要
lset
命令
MrBayes > lset Nst=6 Rates=Invgamma Ngammacat=4
Setting Nst to 6
Setting Rates to Invgamma
Setting Ngammacat to 4
Successfully set likelihood model parameters
MrBayes >
在进化模型中,除了替代速率矩阵外,还有一项重要的内容,即组分频率向量。
statefreqpr
组分频率先验分布区别于替代速率矩阵相关参数的设定,组分频率的设定需要通过prset
命令完成。事实上prset
是针对所有相关参数的先验分布设定的。
模型设定完成后,接下来就可以启动MCMC模拟了。MrBayes启动MCMC需要命令mcmc
,而与MCMC有关的参数需要命令mcmcp
来设定。
Ngen
mcmc模拟的代数Nruns
默认值2,独立运行run数量Nchains
默认值4,每个run的马尔可夫链数量Samplefreq
默认值500,数据采样频率Printfreq
默认值1000,屏幕输出频率Stoprule
默认值No,当设为Yes时,mcmc会根据监测收否已到达收敛参数而自动停止运行,否则会一致运行到Ngen
所设置的代数Stopval
当Stoprule=yes
时,收敛参数达到该参数设定阈值时,mcmc即停止。MrBayes > mcmcp Ngen=1500000 Nruns=2 Nchains=4 Samplefreq=1000 Stoprule=yes Stopval=0.01
Setting number of generations to 1500000
Setting number of runs to 2
Setting number of chains to 4
Setting sample frequency to 1000
Using stopping rule.
Setting burnin fraction to 0.25
Setting chain output file names to "example.nex.run<i>.<p/t>"
Successfully set chain parameters
MrBayes > mcmc
Mrbayes推荐以Average standard deviation of split frequencies < 0.01来判断MCMC的收敛程度,即结果优良程度。所以当模拟代数已经完成,而Average standard deviation of split frequencies已经小于0.01时,即可终止MCMC。
sump
汇总数据样本中的模型参数MrBayes > sump
Summarizing parameters in files example.nex.run1.p and example.nex.run2.p
Writing summary statistics to file example.nex.pstat
Using relative burnin ('relburnin=yes'), discarding the first 25 % of samples
Below are rough plots of the generation (x-axis) versus the log
probability of observing the data (y-axis). You can use these
graphs to determine what the burn in for your analysis should be.
When the log probability starts to plateau you may be at station-
arity. Sample trees and parameters after the log probability
plateaus. Of course, this is not a guarantee that you are at sta-
tionarity. Also examine the convergence diagnostics provided by
the 'sump' and 'sumt' commands for all the parameters in your
model. Remember that the burn in is the number of samples to dis-
card. There are a total of ngen / samplefreq samples taken during
a MCMC analysis.
Overlay plot for both runs:
(1 = Run number 1; 2 = Run number 2; * = Both runs)
------------------------------------------------------------ -5719.47
| 1 |
| |
| 1 |
| 2 1 |
| 2 2 1 11 1 |
| 1 1 122 1 12 222 1 1 1 |
|**1 2 1 22 1 2 * 21 2 2 1 1|
| 1 2 1221 121 *2 2 1 1 22 12 * * 2|
| 2 1 12 22 1 1 2 1 |
| 12 22 2 1 2 2 1 2 2 2 1 2 |
| 2 1 1 2 1 2 1 * |
| 2 * 1 1 2 2 1 |
| 1 1 1 1 2 22 |
| 1 1 12 |
| 2 |
--------------------------------------------------- -5728.29
^ ^
125000 500000
Estimated marginal likelihoods for runs sampled in files
"example.nex.run1.p" and "example.nex.run2.p":
(Use the harmonic mean for Bayes factor comparisons of models)
(Values are saved to the file example.nex.lstat)
Run Arithmetic mean Harmonic mean
--------------------------------------
1 -5719.39 -5734.56
2 -5719.30 -5735.78
--------------------------------------
TOTAL -5719.35 -5735.34
--------------------------------------
Model parameter summaries over the runs sampled in files
"example.nex.run1.p" and "example.nex.run2.p":
Summaries are based on a total of 752 samples from 2 runs.
Each run produced 501 samples of which 376 samples were included.
Parameter summaries saved to file "example.nex.pstat".
95% HPD Interval
--------------------
Parameter Mean Variance Lower Upper Median min ESS* avg ESS PSRF
--------------------------------------------------------------------------------------------------
TL 3.184384 0.106319 2.607986 3.855185 3.165946 206.80 249.29 0.999
r(A<->C) 0.041903 0.000071 0.025691 0.057371 0.041786 195.65 218.26 0.999
r(A<->G) 0.492460 0.002216 0.403166 0.583745 0.494164 136.48 157.84 1.000
r(A<->T) 0.035249 0.000064 0.020791 0.051518 0.034684 198.30 241.48 1.003
r(C<->G) 0.029333 0.000161 0.008188 0.054821 0.028833 242.20 279.11 1.001
r(C<->T) 0.381829 0.001809 0.297725 0.459089 0.381034 47.86 115.95 1.001
r(G<->T) 0.019225 0.000167 0.000066 0.042855 0.016916 211.99 222.03 1.000
pi(A) 0.354387 0.000158 0.331452 0.379060 0.354105 270.03 282.43 0.999
pi(C) 0.321965 0.000114 0.302611 0.342663 0.321753 205.51 256.60 0.999
pi(G) 0.079765 0.000044 0.067832 0.093384 0.079435 151.46 192.30 1.000
pi(T) 0.243882 0.000115 0.222965 0.263219 0.243739 272.79 286.54 1.000
alpha 0.599274 0.033792 0.341705 0.945335 0.561036 169.28 180.25 0.999
pinvar 0.145666 0.006536 0.000059 0.270895 0.146287 155.67 170.06 0.999
--------------------------------------------------------------------------------------------------
* Convergence diagnostic (ESS = Estimated Sample Size); min and avg values
correspond to minimal and average ESS among runs.
ESS value below 100 may indicate that the parameter is undersampled.
Convergence diagnostic (PSRF = Potential Scale Reduction Factor; Gelman
and Rubin, 1992) should approach 1.0 as runs converge.
MrBayes >
sumt
汇总样本树,并生成一致树MrBayes > sumt
Summarizing trees in files "example.nex.run1.t" and "example.nex.run2.t"
Using relative burnin ('relburnin=yes'), discarding the first 25 % of sampled trees
Writing statistics to files example.nex.<parts|tstat|vstat|trprobs|con>
Examining first file ...
Found one tree block in file "example.nex.run1.t" with 501 trees in last block
Expecting the same number of trees in the last tree block of all files
Tree reading status:
0 10 20 30 40 50 60 70 80 90 100
v-------v-------v-------v-------v-------v-------v-------v-------v-------v-------v
*********************************************************************************
Read a total of 1002 trees in 2 files (sampling 752 of them)
(Each file contained 501 trees of which 376 were sampled)
General explanation:
In an unrooted tree, a taxon bipartition (split) is specified by removing a
branch, thereby dividing the species into those to the left and those to the
right of the branch. Here, taxa to one side of the removed branch are denoted
'.' and those to the other side are denoted '*'. Specifically, the '.' symbol
is used for the taxa on the same side as the outgroup.
In a rooted or clock tree, the tree is rooted using the model and not by
reference to an outgroup. Each bipartition therefore corresponds to a clade,
that is, a group that includes all the descendants of a particular branch in
the tree. Taxa that are included in each clade are denoted using '*', and
taxa that are not included are denoted using the '.' symbol.
The output first includes a key to all the bipartitions with frequency larger
or equual to (Minpartfreq) in at least one run. Minpartfreq is a parameter to
sumt command and currently it is set to 0.10. This is followed by a table
with statistics for the informative bipartitions (those including at least
two taxa), sorted from highest to lowest probability. For each bipartition,
the table gives the number of times the partition or split was observed in all
runs (#obs) and the posterior probability of the bipartition (Probab.), which
is the same as the split frequency. If several runs are summarized, this is
followed by the minimum split frequency (Min(s)), the maximum frequency
(Max(s)), and the standard deviation of frequencies (Stddev(s)) across runs.
The latter value should approach 0 for all bipartitions as MCMC runs converge.
This is followed by a table summarizing branch lengths, node heights (if a
clock model was used) and relaxed clock parameters (if a relaxed clock model
was used). The mean, variance, and 95 % credible interval are given for each
of these parameters. If several runs are summarized, the potential scale
reduction factor (PSRF) is also given; it should approach 1 as runs converge.
Node heights will take calibration points into account, if such points were
used in the analysis.
Note that Stddev may be unreliable if the partition is not present in all
runs (the last column indicates the number of runs that sampled the partition
if more than one run is summarized). The PSRF is not calculated at all if
the partition is not present in all runs.The PSRF is also sensitive to small
sample sizes and it should only be considered a rough guide to convergence
since some of the assumptions allowing one to interpret it as a true potential
scale reduction factor are violated in MrBayes.
List of taxa in bipartitions:
1 -- Tarsius_syrichta
2 -- Lemur_catta
3 -- Homo_sapiens
4 -- Pan
5 -- Gorilla
6 -- Pongo
7 -- Hylobates
8 -- Macaca_fuscata
9 -- M_mulatta
10 -- M_fascicularis
11 -- M_sylvanus
12 -- Saimiri_sciureus
Key to taxon bipartitions (saved to file "example.nex.parts"):
ID -- Partition
------------------
1 -- .***********
2 -- .*..........
3 -- ..*.........
4 -- ...*........
5 -- ....*.......
6 -- .....*......
7 -- ......*.....
8 -- .......*....
9 -- ........*...
10 -- .........*..
11 -- ..........*.
12 -- ...........*
13 -- ..***.......
14 -- ..*********.
15 -- ..**********
16 -- .......***..
17 -- .......****.
18 -- .......**...
19 -- ..****......
20 -- ..*****.....
21 -- ..**........
------------------
Summary statistics for informative taxon bipartitions
(saved to file "example.nex.tstat"):
ID #obs Probab. Sd(s) Min(s) Max(s) Nruns
----------------------------------------------------------------
13 752 1.000000 0.000000 1.000000 1.000000 2
14 752 1.000000 0.000000 1.000000 1.000000 2
15 752 1.000000 0.000000 1.000000 1.000000 2
16 752 1.000000 0.000000 1.000000 1.000000 2
17 752 1.000000 0.000000 1.000000 1.000000 2
18 752 1.000000 0.000000 1.000000 1.000000 2
19 752 1.000000 0.000000 1.000000 1.000000 2
20 752 1.000000 0.000000 1.000000 1.000000 2
21 751 0.998670 0.001881 0.997340 1.000000 2
----------------------------------------------------------------
Convergence diagnostic (standard deviation of split frequencies)
should approach 0.0 as runs converge.
Summary statistics for branch and node parameters
(saved to file "example.nex.vstat"):
95% HPD Interval
--------------------
Parameter Mean Variance Lower Upper Median PSRF Nruns
--------------------------------------------------------------------------------------
length[1] 0.541456 0.009519 0.353274 0.717554 0.537658 0.999 2
length[2] 0.372879 0.005509 0.238197 0.516832 0.364400 1.001 2
length[3] 0.051215 0.000146 0.028626 0.075192 0.050526 0.999 2
length[4] 0.064156 0.000159 0.042227 0.089955 0.063130 0.999 2
length[5] 0.061693 0.000241 0.034584 0.092876 0.059972 0.999 2
length[6] 0.153199 0.000843 0.102179 0.209389 0.151336 1.000 2
length[7] 0.180160 0.001205 0.116956 0.247668 0.178352 1.002 2
length[8] 0.017412 0.000035 0.007366 0.029919 0.016864 1.001 2
length[9] 0.023680 0.000046 0.011245 0.037230 0.022899 1.000 2
length[10] 0.059184 0.000154 0.037673 0.085980 0.058277 0.999 2
length[11] 0.075561 0.000418 0.038818 0.117051 0.074517 1.013 2
length[12] 0.476301 0.006387 0.326916 0.635882 0.464266 1.000 2
length[13] 0.089409 0.000513 0.049164 0.133348 0.087979 0.999 2
length[14] 0.130533 0.002517 0.034572 0.221818 0.126364 0.999 2
length[15] 0.301198 0.005515 0.159970 0.446587 0.296882 1.001 2
length[16] 0.048544 0.000358 0.015916 0.086822 0.046766 1.008 2
length[17] 0.274635 0.002832 0.179287 0.388344 0.270745 0.999 2
length[18] 0.035885 0.000108 0.017321 0.057875 0.035077 0.999 2
length[19] 0.059240 0.000540 0.019800 0.104376 0.057129 1.001 2
length[20] 0.137187 0.001602 0.057767 0.212151 0.132745 1.000 2
length[21] 0.030895 0.000147 0.009996 0.055880 0.029637 1.004 2
--------------------------------------------------------------------------------------
Convergence diagnostic (PSRF = Potential Scale Reduction Factor; Gelman
and Rubin, 1992) should approach 1.0 as runs converge. NA is reported when
deviation of parameter values within all runs is 0 or when a parameter
value (a branch length, for instance) is not sampled in all runs.
Summary statistics for partitions with frequency >= 0.10 in at least one run:
Average standard deviation of split frequencies = 0.000209
Maximum standard deviation of split frequencies = 0.001881
Average PSRF for parameter values (excluding NA and >10.0) = 1.001
Maximum PSRF for parameter values = 1.013
Clade credibility values:
/---------------------------------------------------------- Tarsius_syrichta (1)
|
|---------------------------------------------------------- Lemur_catta (2)
|
| /-------- Homo_sapiens (3)
| /---100--
| | \-------- Pan (4)
| /--100--
| | \----------------- Gorilla (5)
| /--100--
| \------------------------- Pongo (6)
| /--100--
| | \--------------------------------- Hylobates (7)
| |
| | /-------- Macaca_fuscata (8)
| /---100-- /---100--
| | | | \-------- M_mulatta (9)
| | | /--100--
| | | | \----------------- M_fascicularis (10)
\--100-- \------100------
| \------------------------- M_sylvanus (11)
|
\-------------------------------------------------- Saimiri_sciure~ (12)
Phylogram (based on average branch lengths):
/---------------------------------------- Tarsius_syrichta (1)
|
|--------------------------- Lemur_catta (2)
|
| /---- Homo_sapiens (3)
| /-
| | \----- Pan (4)
| /------
| | \---- Gorilla (5)
| /---
| \----------- Pongo (6)
| /---------
| | \------------- Hylobates (7)
| |
| | /-- Macaca_fuscata (8)
| /-------- /-
| | | | \-- M_mulatta (9)
| | | /---
| | | | \---- M_fascicularis (10)
\--------------------- \-------------------
| \------ M_sylvanus (11)
|
\---------------------------------- Saimiri_sciure~ (12)
|-------------| 0.200 expected changes per site
Calculating tree probabilities...
Credible sets of trees (2 trees sampled):
99 % credible set contains 1 tree
MrBayes >
example.nex.con.tre 最终所得Bayes树
Huelsenbeck, J. P. and F. Ronquist. 2001. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754-755. DOI: 10.1093/bioinformatics/17.8.754
Ronquist, F. and J. P. Huelsenbeck. 2003. Mrbayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572-1574. DOI: 10.1093/bioinformatics/btg180
Ronquist, F., M. Teslenko, P. van der Mark, D. Ayres, A. Darling, S. Höhna, B. Larget, L. Liu, M. A. Suchard, and J. P. Huelsenbeck. 2012b. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Systematic Biology 61:539-542. DOI: https://doi.org/10.1093/sysbio/sys029