用 VEP 注释突变数据

生信菜鸟团

发布于 2020-04-28 11:05:35

5.9K0

发布于 2020-04-28 11:05:35

文章被收录于专栏：生信菜鸟团

软件首页：http://asia.ensembl.org/info/docs/tools/vep/script/vep_tutorial.html

安装

软件为 Perl 脚本，依赖：

•gcc, g++ and make•Perl version 5.10 or above recommended (tested on 5.10, 5.14, 5.18, 5.22, 5.26)•Perl packages:•Archive::Zip•DBD::mysql•DBI

从 GitHub 下载即可使用：

git clone https://github.com/Ensembl/ensembl-vep.gitcd ensembl-vep

建议直接用 docker 安装（下面的步骤也以 docker 为例）：

docker pull ensemblorg/ensembl-vep

创建文件夹以存放数据库：

# 在宿主机上创建目录mkdir $HOME/vep_data
# 确保 docker 容器有读写权限so the docker container can write in the directory (VEP output):chmod a+rwx $HOME/vep_data

下载数据库

•下载 GRCh38 的注释数据和 fasta：

docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep perl INSTALL.pl -a cf -s homo_sapiens -y GRCh38

•下载插件：

# 下载所有插件docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep perl INSTALL.pl -a cfp -s homo_sapiens -y GRCh38 -g all# 只下载特定插件docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep perl INSTALL.pl -a cfp -s homo_sapiens -y GRCh38 -g dbNSFP,CADD,G2P

注释数据保存于容器内 /opt/vep/.vep/homo_sapiens 文件夹下，插件保存于容器内 /opt/vep/.vep/Plugins 文件夹下，分别对于宿主机中的 $HOME/vep_data/homo_sapiens 和 $HOME/vep_data/Plugins。

所以除了直接用 perl INSTALL.pl 之外，我们也可以手动下载解压数据库和插件放到 $HOME/vep_data ：

cd $HOME/vep_datacurl -O ftp://ftp.ensembl.org/pub/release-99/variation/vep/homo_sapiens_vep_99_GRCh38.tar.gztar xzf homo_sapiens_vep_99_GRCh38.tar.gz
curl -O ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

数据库中包含的内容：

插件下载地址：https://github.com/Ensembl/VEP_plugins

运行 VEP

运行示例数据

创建容器：

docker run -it --rm -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep /bin/bash

运行示例数据

./vep -i examples/homo_sapiens_GRCh38.vcf --cache --force_overwrite

默认输出为 VEP 格式的文本 variant_effect_output.txt：

head variant_effect_output.txt
## ENSEMBL VARIANT EFFECT PREDICTOR v99.0## Output produced at 2017-03-21 14:51:27## Connected to homo_sapiens_core_99_38 on ensembldb.ensembl.org## Using cache in /homes/user/.vep/homo_sapiens/99_GRCh38## Using API version 99, DB version 99## polyphen version 2.2.2## sift version sift5.2.2## COSMIC version 78## ESP version 20141103## gencode version GENCODE 25## genebuild version 2014-07## HGMD-PUBLIC version 20162## regbuild version 16## assembly version GRCh38.p7## ClinVar version 201610## dbSNP version 147## Column descriptions:## Uploaded_variation : Identifier of uploaded variant## Location : Location of variant in standard coordinate format (chr:start or chr:start-end)## Allele : The variant allele used to calculate the consequence## Gene : Stable ID of affected gene## Feature : Stable ID of feature## Feature_type : Type of feature - Transcript, RegulatoryFeature or MotifFeature## Consequence : Consequence type## cDNA_position : Relative position of base pair in cDNA sequence## CDS_position : Relative position of base pair in coding sequence## Protein_position : Relative position of amino acid in protein## Amino_acids : Reference and variant amino acids## Codons : Reference and variant codon sequence## Existing_variation : Identifier(s) of co-located known variants## Extra column keys:## IMPACT : Subjective impact classification of consequence type## DISTANCE : Shortest distance from variant to transcript## STRAND : Strand of the feature (1/-1)## FLAGS : Transcript quality flags#Uploaded_variation  Location     Allele  Gene             Feature          Feature_type  Consequence        ...rs7289170            22:17181903  G       ENSG00000093072  ENST00000262607  Transcript    synonymous_variant ...rs7289170            22:17181903  G       ENSG00000093072  ENST00000330232  Transcript    synonymous_variant ...

标准结果文本为 14 列，由 TAB 分割：

1.Uploaded variation ：突变 ID2.Location ：位置3.Allele4.Gene ：基因 Ensembl ID5.Feature ：Ensembl stable ID of feature6.Feature type ：feature 类型，包括 Transcript，RegulatoryFeature，MotifFeature三种7.Consequence ：突变造成的结果8.Position in cDNA ：在 cDNA 序列中的相对位置9.Position in CDS ：在 CDS 中的相对位置10.Position in protein ：对应氨基酸在蛋白中的相对位置11.Amino acid change ：氨基酸的改变12.Codon change ：the alternative codons with the variant base in upper case13.Co-located variation ：注释突变 rsid14.Extra ：其他等等

运行时可加上 -vcf 参数，输出 VCF 格式； -o 指定输出文件名。

如何提高运行速度

1. 用 convert_cache.pl 脚本转换数据库文件，建立索引以大幅提高数据库检索速度：

# 转换所有数据库perl convert_cache.pl -species all -version all

2. 用 --fork 参数多线程运行 VEP：

./vep -i my_input.vcf --fork 4 --offline

3. 用 tmpfs 将数据库写入内存。4. 安装 Set::IntervalTree Perl package，提高注释速度。5. 安装 Ensembl::XS Perl package，它包含 VEP 中某些关键子程序的编译版本，运行速度可提高 5-10% 。6. 将输入文件按染色体进行排序。7. 对于非常大的 VCF 文件（如 WGS）建议将文件拆分运行，例如：

tabix -h variants.vcf.gz 12:1000000-20000000 | ./vep --cache --vcf

结果过滤

得到输出结果后，可用 filter_vep 进行结果过滤。

./vep -i in.vcf -o out.txt -cache -everything./filter_vep -i out.txt -o out_filtered.txt -filter "[filter_text]"

也可通过管道使用：

./vep -i in.vcf -o stdout -cache -check_existing | ./filter_vep -filter "not Existing_variation" -o out.txt

-filter 后的内容即是过滤条件，主要包括三个部分：Field 检索域（用 --list 参数可以查看结果中所有可用的检索域），Operator 操作符以及 Value 值。举几个例子：

# match entries where Feature (Transcript) is "ENST00000307301"--filter "Feature is ENST00000307301"
# match entries where Protein_position is less than 10--filter "Protein_position < 10"
# match entries where Consequence contains "stream" (this will match upstream and downstream)--filter "Consequence matches stream"

在一些情况下，你可能还想探究某些字段是否被注释到，如：

# match entries where the gene symbol is defined--filter "SYMBOL"

基于两个 Field 值的比较来筛选数据（需要在作为 Value 的 Field 前加上 # ）：

# match entries where AFR_AF is greater than EUR_AF--filter "AFR_AF > #EUR_AF"

基于字符串过滤：

# filter for missense variants in CCDS transcripts where the variant falls in a protein domain--filter "Consequence is missense_variant and CCDS and DOMAINS"
# find variants where the allele frequency is greater than 10% in either AFR or EUR populations--filter "AFR_AF > 0.1 or EUR_AF > 0.1"
# filter out known variants--filter "not Existing_variation"

操作符

•is (相当于 = ，eq) : 完全相同

# get only transcript consequences--filter "Feature_type is Transcript"

•!= (相当于 ne) : 非

# filter out tolerated SIFT predictions--filter "SIFT != tolerated"

•match (相当于 matches , re , regex) : 可用正则表达式匹配文本

# match stop_gained, stop_lost and stop_retained--filter "Consequence match stop"

•< (相当于 lt) : 小于，注意这里缺失值不视作 0

# find SIFT scores less than 0.1--filter "SIFT < 0.1"

•> (相当于 gt) : 大于

# find variants not in the first exon--filter "Exon > 1"

•<= (相当于 lte) : 小于等于•>= (相当于 gte) : 大于等于•exists (相当于 ex , defined) : 存在•in : 基于列表或文件进行查找

# find variants in a list of gene names--filter "SYMBOL in BRCA1,BRCA2"
# filter using a file of MotifFeatures--filter "Feature in /data/files/motifs_list.txt"