比对序列重编码 Alignment Sequence Recoding

Published at 2020-06-26 10:05

Author：zhixy

比对序列重编码的意义和方法

为消除序列组成异质性对系统发育重建的影响，除了采用异构模型之外，还可以通过对序列重编码在一定程度上消除其影响。

对DNA序列来说，一种常见的重编码为RY法：

A,G -> R (purines)
C,T -> Y (pyrimidines)

RY编码可以去除数据中的组成异质性，但它也可以去除信息丰富的转换型碱基替代。

对蛋白序列来说，常见的编码有根据氨基酸物理化学属性差异而来的Dayhoff6法：

C -> 1
A, G, P, S, T -> 2
D, E, N, Q -> 3
H, K, R -> 4
I, L, M, V -> 5
F, Y, W -> 6

在此基础上，还可将F,Y,W,I,L,M,V合为一组，将C视为missing data，这样就可以用沿用DNA的编码方式：

C -> ?
A, G, P, S, T -> A
D, E, N, Q -> G
H, K, R -> C
F, Y, W, I, L, M, V -> T

此外，根据氨基酸的疏水性/极性，还可编码为两组（hp）：

A,C,F,G,I,L,M,V,W -> h
D,E,H,K,N,P,Q,R,S,T,Y -> p

软件

可实现序列重编码的有PhyloBayes (串行单核版本，通过-recode参数实现) 和P4 (见以下示例)。

(base) [user@server ~]# p4

p4 v 1.3.0 [2018-07-28], 28 July, 2018

usage:
    p4
 or
    p4 [-i] [-x] [-d] [yourScriptOrDataFile] [anotherScriptOrDataFile ...]
 or
    p4 --help

p4 is a Python package for phylogenetics.
p4 is also the name of a Python script that loads the p4 package.

There is documentation at http://p4.nhm.ac.uk 

Using the p4 script, after reading in the (optional) files on the
command line, p4 goes interactive unless one of the files on the
command line is a Python script.  Use the -i option if you want to go
interactive even if you are running a script.  Use the -x option to
force exit, even if there was no Python script read.  If you use the
-d option, then p4 draws any trees that are read in on the command
line, and then exits.

Peter Foster
The Natural History Museum, London
p.foster@nhm.ac.uk

(Control-d to quit.)

p4> read("example.phylip")
p4> aln = var.alignments[0]
p4> aln.recodeDayhoff() # 根据Dayhoff6规则重编码，此时aln可用于后续的系统发育分析。
p4> aln.writePhylip('example_recoded.phy')

(base) [user@server ~]# head -n 20 example_recoded.phy 
 10  599
Aeropyrum0 25432462253223412512324331224624535242555246236564
Arabidopsi 25232452253445413215235231224626535242564526542552
Archaeoglo 25532452553225414212242231224624535222554526245524
Candida_al 25332452253436413212335231224624535242564526524544
Chytridiom ------------------------31224624535242564326522544
Cryptococc 25532452253436413215325231224624535242564326554544
Cryptospor 25332552223645513214535421224624535242564526522554
Dictyostel 25532252253423413212225231224624535242564526532554
Drosophila 25532432553422413212235231224624535242564526524254
Encephalit 25535452223456513212233621224624535242564526524544

           55322145125423424236552255552252542242523554333355
           55415163126324624236555355252222542243525554433645
           55322123125324224236655255252252542244553554533645
           21325154126333624236555255252222542264523554233645
           55325156125323624236555255252223542264523554233645
           55315162125323624236555255252222542264523554633645
           55415162126325413226653255252222542254533554533645
           55415164125323624236552255252222542244523554233645

参考文献

Dayhoff, M.O.; Schwartz, R.M. A Model of Evolutionary Change in Proteins. In Atlas of Protein Sequence and Structure; National Biomedical Research Foundation: Washington, DC, USA, 1978.

Columns

Python ICNP Others R Linux Phylogenetics Phylogenomics Genomics Reference Evolution Bioinformatics Protocols Metagenomics Statistics MinIO—分布式对象存储服务器

A Lab of Microbial Systematics and Evolution

比对序列重编码 Alignment Sequence Recoding

比对序列重编码的意义和方法

软件

参考文献

Columns

Search Document