首先,突变的分类方法有很多种,按照其是否会导致癌症进展,可以分为驱动突变(driver mutation)和乘客突变(passenger mutation)。前者在肿瘤细胞中具有选择性生长优势的突变,后者对肿瘤细胞的选择性生长优势无直接或间接影响的突变。目前来说,推断驱动突变的算法有很多,可以参考这篇综述:https://academic.oup.com/bib/article/17/4/642/2240387。总的来说,可以分为以下 5 种:
m_bbv068f1p
大部分的算法,都是基于各大突变注释数据库,比如:COSMIC、TCGA、ICGC、cBioPortal、Cancer3D、dSysMap、ENCODE、NIH Epigenome Roadmap、FANTOM5 等等,这篇综述也进行了一定的比较:https://www.pnas.org/content/113/50/14330
接下来我们使用 MutSigCV 来对我们的数据进行驱动突变分析,并进行一定的比较
参考:
https://captorr.github.io/2018/07/18/mutsigcv/
https://www.jianshu.com/p/c86fadbff4c3
https://software.broadinstitute.org/cancer/cga/mutsig
首先需要下载 MATLABMCR:https://ww2.mathworks.cn/products/compiler/matlab-runtime.html,只要版本大于 2014 就行,这里选择的是 2016a
cd ~/wes_cancer/biosoft
mkdir MatlabMCR
cd MatlabMCR
wget -c http://ssd.mathworks.com/supportfiles/downloads/R2016a/deployment_files/R2016a/installers/glnxa64/MCR_R2016a_glnxa64_installer.zip
unzip MCR_R2016a_glnxa64_installer.zip
./install
安装过程需要选择一个有权限的安装路径,安装好了之后可能要手动赋值一个变量:
LD_LIBRARY_PATH="~/wes_cancer/biosoft/MATLAB/v901/runtime/glnxa64:~/wes_cancer/biosoft/MATLAB/v901/bin/glnxa64:~/wes_cancer/biosoft/MATLAB/v901/sys/os/glnxa64"
然后就是下载 MutSigCV
cd ~/wes_cancer/biosoft
wget -c https://software.broadinstitute.org/cancer/cga/sites/default/files/data/tools/mutsig/MutSigCV_1.41.zip
unzip MutSigCV_1.41.zip
cd MutSigCV_1.41
接下来需要下载几个必备的文件:
wget -c http://www.broadinstitute.org/cancer/cga/sites/default/files/data/tools/mutsig/reference_files/gene.covariates.txt
wget -c http://www.broadinstitute.org/cancer/cga/sites/default/files/data/tools/mutsig/reference_files/exome_full192.coverage.zip
wget -c http://www.broadinstitute.org/cancer/cga/sites/default/files/data/tools/mutsig/reference_files/mutation_type_dictionary_file.txt
按照官网要求,还需要参考基因组的序列文件,但是官网只提供了 hg18 和 hg19 版本,我需要自己制作 hg38 版本的:
sed -n '1,/chr1_KI270706v1_random/p' ../data/Homo_sapiens_assembly38.fasta | grep -v '^>' >../data/chr_files_hg38.txt
最后运行代码,即可获取驱动突变分析结果:
cd ~/wes_cancer/project
../biosoft/MutSigCV_1.41/run_MutSigCV.sh ~/wes_cancer/biosoft/MatlabMCR/v901/ ./7.annotation/gatk/gatk4.1.4.0_merge.maf ../biosoft/MutSigCV_1.41/exome_full192.coverage.txt ../biosoft/MutSigCV_1.41/gene.covariates.txt gatk_merge_mutsig ../biosoft/MutSigCV_1.41/mutation_type_dictionary_file.txt ../data/chr_files_hg38.txt
输出结果有几个文件:
8.9M 5月 18 11:56 gatk_merge_mutsig.mutations.txt
3.3M 5月 18 11:56 gatk_merge_mutsig.coverage.txt
130 5月 18 11:56 gatk_merge_mutsig.categs.txt
1.1M 5月 18 11:56 gatk_merge_mutsig.sig_genes.txt
其中gatk_merge_mutsig.sig_genes.txt
是 MutSigCV 的结果报告:
$ head gatk_merge_mutsig.sig_genes.txt
gene expr reptime hic N_nonsilent N_silent N_noncoding n_nonsilent n_silent n_noncoding nnei x X p q
KIAA0040 NaN NaN NaN 17208 4392 0 8 0 0 31 6 1182048 1.490349e-07 1.591576e-03
CCDC84 1457410 180 55 59856 15096 0 5 0 0 50 0 1262496 1.687600e-07 1.591576e-03
FAM153A 1267336 553 18 58536 14112 0 4 0 0 50 3 1391880 3.217081e-06 1.305224e-02
C8orf41 91648 507 12 85488 26472 0 4 0 0 50 2 1662480 4.561476e-06 1.305224e-02
TTC31 1019591 315 50 89232 26832 0 4 0 0 50 1 1472088 4.837963e-06 1.305224e-02
CREB3 922448 190 48 64176 18480 0 4 0 0 50 4 1475904 5.346962e-06 1.305224e-02
ZIM2 221899 902 -70 92136 22920 0 4 0 0 50 1 1473768 5.467921e-06 1.305224e-02
CCIN 858124 257 46 98592 28776 0 4 0 0 50 1 1572144 5.535887e-06 1.305224e-02
USP50 325027 203 27 59136 14952 0 4 0 0 50 8 1663848 8.386828e-06 1.757693e-02
MutSig输出分析中的“ nnei”,“ x”和“ X”值是算法计算得到的基因背景突变率。我们重点关注的是显著性的 p 值和 q 值,后者为校正过的 p 值
nnei gives the number of neighboring genes that are pooled together to compute the background mutation rate for that gene; these genes are not necessarily adjacent on the genome, but rather they have nearby covariate values. x gives the number of mutated bases in these neigboring genes that are either silent or non-coding, while X gives the total number of bases related to these neighboring genes.
需要注意的是,与MutSig捆绑在一起的协变量文件(gene.covariates.txt和exome_full192.coverage.txt)具有非标准的基因名称(non Hugo_Symbols)。MAF中的Hugo_Symbols与协变量文件中的非Hugo_symbols之间的差异导致MutSig程序忽略此类基因。例如,食管癌中众所周知的驱动基因KMT2D在MutSig协变量文件中表示为MLL2。这导致KMT2D从分析中被忽略,并且在MutSig结果中被表示为微不足道的基因。可以用 R包 maftools 的 prepareMutSig 函数来进行纠正:
library(data.table)
tmp=fread('./7.annotation/gatk/gatk4.1.4.0_merge.maf')
laml = read.maf(maf = tmp)
laml.mutsig.corrected = prepareMutSig(maf = laml)