前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Bioinfo02-用splatter制作模拟单细胞数据

Bioinfo02-用splatter制作模拟单细胞数据

作者头像
北野茶缸子
发布2022-05-19 11:24:58
9930
发布2022-05-19 11:24:58
举报
文章被收录于专栏:北野茶缸子的专栏
  • 参考:
    • Introduction to Splatter (bioconductor.org)[1]
    • Oshlack/splatter: Simple simulation of single-cell RNA sequencing data (github.com)[2]
    • splatPop: simulating single-cell data for populations[3]
    • splatter包参数详解 (qq.com)

准备工作

最近看单细胞算法文章看到了一个包,可以很方便的根据需求生成各种模拟数据,制作ground truth 岂不是很方便?

试试就试试。

直接安装即可,很容易:

代码语言:javascript
复制
BiocManager::install("splatter")
p_load(splatter, scuttle, scater)

1-功能介绍

y1s1,这个包的logo 还挺好看的:

  • Cell group effects: Where multiple, heterogeneous cell-groups are simulated for each individual. These groups could represent different cell-types or the same cell-type before/after a treatment. Group effects include group-specific differential expression (DE) and/or group-specific expression Quantitative Trait Loci (eQTL) effects.
  • Conditional effects between individuals: Where individuals are simulated as belonging to different conditional cohorts (e.g. different treatment groups or groups with different disease statuses). Conditional effects include DE and/or eQTL effects.
  • Batch effect from multiplexed experimental designs: Like in splat, batch effects are simulated by assigning small batch-specific DE effects to all genes. splatPop allows for the simulation of different patterns of batch effects, such as those resulting from multiplexed sequencing designs.

Splatter (splat)是如何进行单细胞数据估计的呢?

splat 模型的核心在于,利用gamma-Poisson 借助cell counts 矩阵来生成基因表达数据。

The core of the Splat model is a gamma-Poisson distribution used to generate a gene by cell matrix of counts. Mean expression levels for each gene are simulated from a gamma distribution[4] and the Biological Coefficient of Variation is used to enforce a mean-variance trend before counts are simulated from a Poisson distribution[5]. Splat also allows you to simulate expression outlier genes (genes with mean expression outside the gamma distribution) and dropout (random knock out of counts based on mean expression). Each cell is given an expected library size (simulated from a log-normal distribution) that makes it easier to match to a given dataset.

2-SplatParams对象

所有splat 模拟相关的参数都存储在SplatParams对象中。

代码语言:javascript
复制
params <- newSimpleParams()
params <- newSimpleParams(nGenes = 200, nCells = 10)

> params
A Params object of class SimpleParams 
Parameters can be (estimable) or [not estimable], 'Default' or  'NOT DEFAULT' 
Secondary parameters are usually set during simulation

Global: 
(GENES)  (CELLS)   [Seed] 
    200       10   977126 

3 additional parameters 

Mean: 
 (Rate)  (Shape) 
    0.3      0.4 

Counts: 
[Dispersion] 
         0.1 

访问及修改参数

我们姑且可以将SplatParams对象理解为splat 模型用于创建模拟单细胞数据的一个参数对象,其包含了这个单细胞模型的全部参数信息,除了基因数、细胞数这些基本信息外,还包括如均值、批次、混杂因素、异常值等等信息。详细的内容可以参考[[SplatParams详细参数介绍]]

访问:

代码语言:javascript
复制
getParam(params, "nGenes") 
#> [1] 10000

修改:

代码语言:javascript
复制
# Set multiple parameters at once (using a list)
params <- setParams(params, update = list(nGenes = 8000, mean.rate = 0.5))
# Extract multiple parameters as a list
getParams(params, c("nGenes", "mean.rate", "mean.shape"))
#> $nGenes
#> [1] 8000
#> 
#> $mean.rate
#> [1] 0.5
#> 
#> $mean.shape
#> [1] 0.6
# Set multiple parameters at once (using additional arguments)
params <- setParams(params, mean.shape = 0.5, de.prob = 0.2)
params
#> A Params object of class SplatParams 
#> Parameters can be (estimable) or [not estimable], 'Default' or  'NOT DEFAULT' 
#> Secondary parameters are usually set during simulation
#> 
#> Global: 
#> (GENES)  (Cells)   [SEED] 
#>    8000      100    81261 
#> 
#> 29 additional parameters 
#> 
#> Batches: 
#>     [Batches]  [Batch Cells]     [Location]        [Scale]       [Remove] 
#>             1            100            0.1            0.1          FALSE 
#> 
#> Mean: 
#>  (RATE)  (SHAPE) 
#>     0.5      0.5 
#> 
#> Library size: 
#> (Location)     (Scale)      (Norm) 
#>         11         0.2       FALSE 
#> 
#> Exprs outliers: 
#> (Probability)     (Location)        (Scale) 
#>          0.05              4            0.5 
#> 
#> Groups: 
#>      [Groups]  [Group Probs] 
#>             1              1 
#> 
#> Diff expr: 
#> [PROBABILITY]    [Down Prob]     [Location]        [Scale] 
#>           0.2            0.5            0.1            0.4 
#> 
#> BCV: 
#> (Common Disp)          (DoF) 
#>           0.1             60 
#> 
#> Dropout: 
#>     [Type]  (Midpoint)     (Shape) 
#>       none           0          -1 
#> 
#> Paths: 
#>         [From]         [Steps]          [Skew]    [Non-linear]  [Sigma Factor] 
#>              0             100             0.5             0.1             0.8

从真实数据估计参数

splat 还可以让我们直接从真实的单细胞数据,singlecellexperiment(sce) 对象中进行参数估计。

创建仿真数据:

代码语言:javascript
复制
set.seed(1)
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)

> sce
class: SingleCellExperiment 
dim: 2000 200 
metadata(0):
assays(1): counts
rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
rowData names(0):
colnames(200): Cell_001 Cell_002 ... Cell_199 Cell_200
colData names(3): Mutation_Status Cell_Cycle Treatment
reducedDimNames(0):
altExpNames(1): Spikes

splat 估计:

代码语言:javascript
复制
> params <- splatEstimate(sce)
NOTE: Library sizes have been found to be normally distributed instead of log-normal. You may want to check this is correct.
> params
A Params object of class SplatParams 
Parameters can be (estimable) or [not estimable], 'Default' or  'NOT DEFAULT' 
Secondary parameters are usually set during simulation

Global: 
(GENES)  (CELLS)   [Seed] 
   2000      200   977126 

29 additional parameters 

Batches: 
    [BATCHES]  [BATCH CELLS]     [Location]        [Scale] 
            1            200            0.1            0.1 
     [Remove] 
        FALSE 

Mean: 
           (RATE)            (SHAPE) 
0.002962686167343  0.496997730070513 

Library size: 
      (LOCATION)           (SCALE)            (NORM) 
      357331.235  11607.2332293176              TRUE 

Exprs outliers: 
(PROBABILITY)     (Location)        (Scale) 
            0              4            0.5 

Groups: 
     [Groups]  [Group Probs] 
            1              1 

Diff expr: 
[Probability]    [Down Prob]     [Location]        [Scale] 
          0.1            0.5            0.1            0.4 

BCV: 
    (COMMON DISP)              (DOF) 
0.752043426792845   11211.8933424157 

Dropout: 
           [Type]         (MIDPOINT)            (SHAPE) 
             none   2.71153535179343  -1.37209356733765 

Paths: 
        [From]         [Steps]          [Skew]    [Non-linear] 
             0             100             0.5             0.1 
[Sigma Factor] 
           0.8 

splat 从真实(仿真)数据中参数估计过程包括以下步骤:

  • 均值:Mean parameters are estimated by fitting a gamma distribution to the mean expression levels.
  • Library size:Library size parameters are estimated by fitting a log-normal distribution to the library sizes.(个人理解为counts 计数的文库大小)
  • 异常表达值:Expression outlier parameters are estimated by determining the number of outliers and fitting a log-normal distribution to their difference from the median.
  • BCV parameters are estimated using the estimateDisp function from the edgeR package. (混杂变量)
  • Dropout parameters are estimated by checking if dropout is present and fitting a logistic function to the relationship between mean expression and proportion of zeros.

详细信息可以参见:Splat simulation parameters (bioconductor.org)[6]

3-利用splat参数估计结果构建表达矩阵

配置好了SplatParams对象(设定完毕用于仿真的参数结果),就可以利用这个参数对象进行模拟了,也就是函数splatSimulate

代码语言:javascript
复制
> sim <- splatSimulate(params, nGenes = 1000, 
                     batchCells = rep(100,10))
Getting parameters...
Creating simulation object...
Simulating library sizes...
Simulating gene means...
Simulating BCV...
Simulating counts...
Simulating dropout (if needed)...
Sparsifying assays...
Automatically converting to sparse matrices, threshold = 0.95
Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
Skipping 'BCV': estimated sparse size 1.5 * dense matrix
Skipping 'CellMeans': estimated sparse size 1.5 * dense matrix
Skipping 'TrueCounts': estimated sparse size 2.79 * dense matrix
Skipping 'counts': estimated sparse size 2.79 * dense matrix
Done!
> sim
class: SingleCellExperiment 
dim: 1000 200 
metadata(1): Params
assays(6): BatchCellMeans BaseCellMeans ... TrueCounts
  counts
rownames(1000): Gene1 Gene2 ... Gene999 Gene1000
rowData names(4): Gene BaseGeneMean OutlierFactor GeneMean
colnames(200): Cell1 Cell2 ... Cell199 Cell200
colData names(3): Cell Batch ExpLibSize
reducedDimNames(0):
altExpNames(0):

可以看到,我们创建了一个200x1000 的单细胞对象。

代码语言:javascript
复制
> head(rowData(sim))
DataFrame with 6 rows and 4 columns
             Gene BaseGeneMean OutlierFactor  GeneMean
      <character>    <numeric>     <numeric> <numeric>
Gene1       Gene1    367.75129             1 367.75129
Gene2       Gene2    183.59043             1 183.59043
Gene3       Gene3    460.64653             1 460.64653
Gene4       Gene4      5.16081             1   5.16081
Gene5       Gene5     90.97209             1  90.97209
Gene6       Gene6     81.68033             1  81.68033
> head(colData(sim))
DataFrame with 6 rows and 3 columns
             Cell       Batch ExpLibSize
      <character> <character>  <numeric>
Cell1       Cell1      Batch1     364054
Cell2       Cell2      Batch1     337135
Cell3       Cell3      Batch1     353075
Cell4       Cell4      Batch1     354925
Cell5       Cell5      Batch1     357749
Cell6       Cell6      Batch1     356384

我们还可以可视化一下,这里参考包CiteFuse中的visualiseDim 函数:

代码语言:javascript
复制
sim <- logNormCounts(sim)
# Plot PCA
sim <- runPCA(sim)
visualiseDim(sim,
             dimNames = "PCA",
             colour_by = "Batch")

可以看到,splatSimulate 函数输出的也就是一个sce 对象,我们可以非常方便的对其进行后续的单细胞分析。

该函数会输出以下信息:

  • Cell information (colData)
    • Cell - Unique cell identifier.
    • Group - The group or path the cell belongs to.
    • ExpLibSize - The expected library size for that cell.
    • Step (paths only) - How far along the path each cell is.
  • Gene information (rowData)
    • Gene - Unique gene identifier.
    • BaseGeneMean - The base expression level for that gene.
    • OutlierFactor - Expression outlier factor for that gene (1 is not an outlier).
    • GeneMean - Expression level after applying outlier factors.
    • DEFac[Group] - The differential expression factor for each gene in a particular group (1 is not differentially expressed).
    • GeneMean[Group] - Expression level of a gene in a particular group after applying differential expression factors.
  • Gene by cell information (assays)
    • BaseCellMeans - The expression of genes in each cell adjusted for expected library size.
    • BCV - The Biological Coefficient of Variation for each gene in each cell.
    • CellMeans - The expression level of genes in each cell adjusted for BCV.
    • TrueCounts - The simulated counts before dropout.
    • Dropout - Logical matrix showing which counts have been dropped in which cells

其实,也就对应了上面提到的SplatParams对象中设计的相关参数。

4-模拟的两种模式

除了上述单类型的细胞外(single),splatSimulate 函数还可以用于生成多个单细胞分群,或是轨迹数据:

which simulation method to use. Options are "single" which produces a single population, "groups" which produces distinct groups (eg. cell types), or "paths" which selects cells from continuous trajectories (eg. differentiation processes).

4.1-cluster

类似batchCells参数(并不能直接指定batch数目),可以用来控制batch 的数目及各个batch 的大小。group.prob 也可以用来控制各个group 的比例及group 数目:

代码语言:javascript
复制
# group
sim.groups <- splatSimulate(group.prob = c(0.5, 0.5), method = "groups",
                            verbose = FALSE, batchCells = rep(1000,2))
sim.groups <- logNormCounts(sim.groups)
sim.groups <- runPCA(sim.groups)
visualiseSce(sim.groups,
             dimNames = "PCA",
             colour_by = "Group", shape_by = "Batch")

4.2-path(trajectory)

对应method 参数,修改为paths模式:

代码语言:javascript
复制
# path
sim.paths <- splatSimulate(de.prob = 0.2, nGenes = 1000, method = "paths",
                           verbose = FALSE, batchCells = rep(200,2))
sim.paths <- logNormCounts(sim.paths)
sim.paths <- runPCA(sim.paths)
tmp.list$sim.paths.df <- colData(sim.paths)
tmp.list$sim.paths.df <- cbind(tmp.list$sim.paths.df, 
                               reducedDim(sim.paths, "PCA")[,1:2])
tmp.list$sim.paths.df <- as.data.frame(tmp.list$sim.paths.df)
ggplot(tmp.list$sim.paths.df) + 
  geom_point(
    aes(PC1, PC2, color = Step, shape = Batch)
  ) + viridis::scale_colour_viridis()

5-其他模拟

批次

其实我上面的数据中,已经通过batchCells 指定批次了。

可以看到,pca 的方法,成功的捕获了模拟数据中分组的差异(PC1),以及由于批次存在的差异(PC2),而且很明显地分组的差异是要大于批次产生的差异的,这也和大部分的真实情况相符:

其实splat 方法也是一个套件:

Each of the Splatter simulation methods has it’s own convenience function. To simulate a single population use splatSimulateSingle() (equivalent to splatSimulate(method = "single")), to simulate groups use splatSimulateGroups() (equivalent to splatSimulate(method = "groups")) or to simulate paths use splatSimulatePaths() (equivalent to splatSimulate(method = "paths")).

其他方法

足足15 个套装方法:

代码语言:javascript
复制
listSims()
#> Splatter currently contains 15 simulations 
#> 
#> Splat (splat) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter    Dependencies:  
#> The Splat simulation generates means from a gamma distribution, adjusts them for BCV and generates counts from a gamma-poisson. Dropout and batch effects can be optionally added. 
#> 
#> Splat Single (splatSingle) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter    Dependencies:  
#> The Splat simulation with a single population. 
#> 
#> Splat Groups (splatGroups) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter    Dependencies:  
#> The Splat simulation with multiple groups. Each group can have it's own differential expression probability and fold change distribution. 
#> 
#> Splat Paths (splatPaths) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter    Dependencies:  
#> The Splat simulation with differentiation paths. Each path can have it's own length, skew and probability. Genes can change in non-linear ways. 
#> 
#> Kersplat (kersplat) 
#> DOI:      GitHub: Oshlack/splatter    Dependencies: scuttle, igraph 
#> The Kersplat simulation extends the Splat model by adding a gene network, more complex cell structure, doublets and empty cells (Experimental). 
#> 
#> splatPop (splatPop) 
#> DOI: 10.1186/s13059-021-02546-1   GitHub: Oshlack/splatter    Dependencies: VariantAnnotation, preprocessCore 
#> The splatPop simulation enables splat simulations to be generated for multiple individuals in a population, accounting for correlation structure by simulating expression quantitative trait loci (eQTL). 
#> 
#> Simple (simple) 
#> DOI: 10.1186/s13059-017-1305-0    GitHub: Oshlack/splatter    Dependencies:  
#> A simple simulation with gamma means and negative binomial counts. 
#> 
#> Lun (lun) 
#> DOI: 10.1186/s13059-016-0947-7    GitHub: MarioniLab/Deconvolution2016    Dependencies:  
#> Gamma distributed means and negative binomial counts. Cells are given a size factor and differential expression can be simulated with fixed fold changes. 
#> 
#> Lun 2 (lun2) 
#> DOI: 10.1093/biostatistics/kxw055     GitHub: MarioniLab/PlateEffects2016     Dependencies: scran, scuttle, lme4, pscl, limSolve 
#> Negative binomial counts where the means and dispersions have been sampled from a real dataset. The core feature of the Lun 2 simulation is the addition of plate effects. Differential expression can be added between two groups of plates and optionally a zero-inflated negative-binomial can be used. 
#> 
#> scDD (scDD) 
#> DOI: 10.1186/s13059-016-1077-y    GitHub: kdkorthauer/scDD    Dependencies: scDD 
#> The scDD simulation samples a given dataset and can simulate differentially expressed and differentially distributed genes between two conditions. 
#> 
#> BASiCS (BASiCS) 
#> DOI: 10.1371/journal.pcbi.1004333     GitHub: catavallejos/BASiCS     Dependencies: BASiCS 
#> The BASiCS simulation is based on a bayesian model used to deconvolve biological and technical variation and includes spike-ins and batch effects. 
#> 
#> mfa (mfa) 
#> DOI: 10.12688/wellcomeopenres.11087.1     GitHub: kieranrcampbell/mfa     Dependencies: mfa 
#> The mfa simulation produces a bifurcating pseudotime trajectory. This can optionally include genes with transient changes in expression and added dropout. 
#> 
#> PhenoPath (pheno) 
#> DOI: 10.1038/s41467-018-04696-6   GitHub: kieranrcampbell/phenopath   Dependencies: phenopath 
#> The PhenoPath simulation produces a pseudotime trajectory with different types of genes. 
#> 
#> ZINB-WaVE (zinb) 
#> DOI: 10.1038/s41467-017-02554-5   GitHub: drisso/zinbwave     Dependencies: zinbwave 
#> The ZINB-WaVE simulation simulates counts from a sophisticated zero-inflated negative-binomial distribution including cell and gene-level covariates. 
#> 
#> SparseDC (sparseDC) 
#> DOI: 10.1093/nar/gkx1113      GitHub: cran/SparseDC   Dependencies: SparseDC 
#> The SparseDC simulation simulates a set of clusters across two conditions, where some clusters may be present in only one condition.

比如这个scDD 就考虑了不同的conditions 的细胞:

The scDD simulation samples a given dataset and can simulate differentially expressed and differentially distributed genes between two conditions.

6-与真实数据集比较

各自相比

splat 提供了观察各个单细胞数据集的方法。

compareSCEs 函数接受列表对象:

代码语言:javascript
复制
set.seed(1)
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulate(params, nGenes = 2000)
sim2 <-splatSimulate(nGenes = 2000)
sim3 <- simpleSimulate(nGenes = 2000, verbose = FALSE)
comparison <- compareSCEs(list(sce = sce, sim1 = sim1,
                               sim2 = sim2, sim3 = sim3))

比如我这里比较:

  • mockSCE 生成的模拟数据;
  • 通过mockSCE 模拟结果参数估计splat 生成的数据;
  • splat 两种模式创建的数据。
代码语言:javascript
复制
> head(comparison$ColData)
         Dataset    sum detected  total PctZero
Cell_001     sce 392907     1502 402665   24.90
Cell_002     sce 398904     1509 405090   24.55
Cell_003     sce 358855     1503 365409   24.85
Cell_004     sce 378909     1527 386163   23.65
Cell_005     sce 384063     1532 389848   23.40
Cell_006     sce 370596     1522 378267   23.90
> head(comparison$RowData)
          Dataset    mean detected MeanCounts  VarCounts CVCounts
Gene_0001     sce  11.460     48.5     11.460   950.1994 2.689817
Gene_0002     sce  78.560     92.0     78.560  9041.7049 1.210385
Gene_0003     sce  21.505     55.5     21.505  2641.8090 2.390074
Gene_0004     sce  20.780     57.0     20.780  1827.4890 2.057225
Gene_0005     sce  18.290     50.0     18.290  1979.6140 2.432633
Gene_0006     sce 191.455     99.5    191.455 46423.7367 1.125391
          MedCounts MADCounts   MeanCPM     VarCPM    CVCPM
Gene_0001       0.0    0.0000  29.69738   6420.309 2.698111
Gene_0002      46.0   63.7518 205.45104  62237.453 1.214276
Gene_0003       1.0    1.4826  56.01672  17949.158 2.391687
Gene_0004       2.0    2.9652  54.13480  12419.975 2.058656
Gene_0005       0.5    0.7413  47.92973  13753.695 2.446835
Gene_0006     119.0  126.7623 500.52564 321714.428 1.133206
              MedCPM     MADCPM MeanLogCPM VarLogCPM  CVLogCPM
Gene_0001   0.000000   0.000000   2.183822  6.987662 1.2104552
Gene_0002 119.251041 163.890597   6.125446  7.454117 0.4457182
Gene_0003   2.607953   3.866551   2.792780  9.128253 1.0818252
Gene_0004   5.085582   7.539884   2.973293  9.256661 1.0232681
Gene_0005   1.231218   1.825403   2.613443  9.025811 1.1495559
Gene_0006 308.946214 330.608840   8.044234  3.727668 0.2400125
          MedLogCPM MADLogCPM PctZero
Gene_0001 0.0000000  0.000000    51.5
Gene_0002 6.9098774  2.455602     8.0
Gene_0003 1.8511790  2.744558    44.5
Gene_0004 2.6051412  3.862382    43.0
Gene_0005 0.8958936  1.328252    50.0
Gene_0006 8.2758656  1.782199     0.5

> table(comparison$RowData$Dataset)

 sce sim1 sim2 sim3 
2000 2000 2000 2000 
> table(comparison$ColData$Dataset)

 sce sim1 sim2 sim3 
 200  200  100  100 

详细统计每个datasets 基因与细胞的信息。以及多种绘图结果:

代码语言:javascript
复制
> names(comparison$Plots)
[1] "Means"        "Variances"    "MeanVar"      "LibrarySizes"
[5] "ZerosGene"    "ZerosCell"    "MeanZeros"    "VarGeneCor"  

非常好看有木有,非常讲究的使用notched box plot:

多比一

使用函数diffSCEs。这里要求数据集的维度需要相同,因此重新配置。

并要求这个ref 的长度是一,所以是多比一啦。

Error in diffSCEs(list(sce = sce, sim1 = sim1, sim2 = sim2, sim3 = sim3), : Assertion on 'ref' failed: Must have length 1.

代码语言:javascript
复制
# compare some to ref
set.seed(1)
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulate(params)
sim2 <-splatSimulate(nGenes = 2000, batchCells = 200)
sim3 <- simpleSimulate(nGenes = 2000, nCells = 200, verbose = FALSE)

difference <- diffSCEs(list(sce = sce, sim1 = sim1,
              sim2 = sim2, sim3 = sim3), ref = "sce")

我们就可以和ref 进行比较了:

其他内容

比如加上tpm 与fpkm 数据,对于sce 对象,直接通过scater 包的方法即可:

代码语言:javascript
复制
sim <- simpleSimulate(verbose = FALSE)
sim <- addGeneLengths(sim)
head(rowData(sim))
#> DataFrame with 6 rows and 3 columns
#>              Gene  GeneMean    Length
#>       <character> <numeric> <numeric>
#> Gene1       Gene1 0.5641399       917
#> Gene2       Gene2 0.0764411       765
#> Gene3       Gene3 2.6791742      5972
#> Gene4       Gene4 1.3782005      3491
#> Gene5       Gene5 4.0117653     15311
#> Gene6       Gene6 0.3536760      1190

tpm(sim) <- calculateTPM(sim, rowData(sim)$Length)
tpm(sim)[1:5, 1:5]
#> 5 x 5 sparse Matrix of class "dgCMatrix"
#>           Cell1    Cell2     Cell3     Cell4     Cell5
#> Gene1 342.21897  .         .       169.73637 170.06608
#> Gene2   .        .         .         .         .      
#> Gene3 131.36922  .       187.68277 182.44101  78.34089
#> Gene4  89.89252  .       183.46630  89.17115   .      
#> Gene5  30.74405 20.50798  83.66284  81.32623  40.74211

如果觉得splat 创建的sce 对象有的内容并不需要使用,可以删除一些metadata 或assay,以对对象大小进行压缩:

代码语言:javascript
复制
sim <- splatSimulate()
#> Getting parameters...
#> Creating simulation object...
#> Simulating library sizes...
#> Simulating gene means...
#> Simulating BCV...
#> Simulating counts...
#> Simulating dropout (if needed)...
#> Sparsifying assays...
#> Automatically converting to sparse matrices, threshold = 0.95
#> Skipping 'BatchCellMeans': estimated sparse size 1.5 * dense matrix
#> Skipping 'BaseCellMeans': estimated sparse size 1.5 * dense matrix
#> Skipping 'BCV': estimated sparse size 1.5 * dense matrix
#> Skipping 'CellMeans': estimated sparse size 1.49 * dense matrix
#> Skipping 'TrueCounts': estimated sparse size 1.65 * dense matrix
#> Skipping 'counts': estimated sparse size 1.65 * dense matrix
#> Done!
minimiseSCE(sim)
#> Minimising SingleCellExperiment...
#> Original size: 43.9 Mb
#> Removing all rowData columns
#> Removing all colData columns
#> Removing all metadata items
#> Keeping 1 assays: counts
#> Removing 5 assays: BatchCellMeans, BaseCellMeans, BCV, CellMeans, TrueCounts
#> Sparsifying assays...
#> Automatically converting to sparse matrices, threshold = 0.95
#> Skipping 'counts': estimated sparse size 1.65 * dense matrix
#> Minimised size: 5.3 Mb (12% of original)
#> class: SingleCellExperiment 
#> dim: 10000 100 
#> metadata(0):
#> assays(1): counts
#> rownames(10000): Gene1 Gene2 ... Gene9999 Gene10000
#> rowData names(0):
#> colnames(100): Cell1 Cell2 ... Cell99 Cell100
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

minimiseSCE(sim, rowData.keep = "Gene", colData.keep = c("Cell", "Batch"),
            metadata.keep = TRUE)
#> Minimising SingleCellExperiment...
#> Original size: 43.9 Mb
#> Keeping 1 rowData columns: Gene
#> Removing 3 rowData columns: BaseGeneMean, OutlierFactor, GeneMean
#> Keeping 2 colData columns: Cell, Batch
#> Removing 1 colData columns: ExpLibSize
#> Keeping 1 assays: counts
#> Removing 5 assays: BatchCellMeans, BaseCellMeans, BCV, CellMeans, TrueCounts
#> Sparsifying assays...
#> Automatically converting to sparse matrices, threshold = 0.95
#> Skipping 'counts': estimated sparse size 1.65 * dense matrix
#> Minimised size: 5.9 Mb (14% of original)
#> class: SingleCellExperiment 
#> dim: 10000 100 
#> metadata(1): Params
#> assays(1): counts
#> rownames(10000): Gene1 Gene2 ... Gene9999 Gene10000
#> rowData names(1): Gene
#> colnames(100): Cell1 Cell2 ... Cell99 Cell100
#> colData names(2): Cell Batch
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

默认会删除rowData(sce), colData(sce) and metadata(sce) 。

splat 还提供了组装比较结果中绘图内容的函数:

代码语言:javascript
复制
p1 <- makeCompPanel(comparison)

y1s1,拼的挺丑的,不展示了。

ps:不过对于多比一的图,似乎存在bug:

代码语言:javascript
复制
makeCompPanel(difference):Error in makeCompPanel(difference) : Assertion on 'comp' failed: Must have length 3, but has length 5.

个人感觉splat 套件还挺值得把玩的,后面看看可不可以个性化的调整各个sce 的batch magnitude 亦或是group 间的var 大小。

参考资料

[1]

Introduction to Splatter (bioconductor.org): https://bioconductor.org/packages/devel/bioc/vignettes/splatter/inst/doc/splatter.html

[2]

Oshlack/splatter: Simple simulation of single-cell RNA sequencing data (github.com): https://github.com/Oshlack/splatter

[3]

splatPop: simulating single-cell data for populations: http://www.bioconductor.org/packages/devel/bioc/vignettes/splatter/inst/doc/splatPop.html#2_Quick_start

[4]

gamma distribution: https://en.wikipedia.org/wiki/Gamma_distribution

[5]

Poisson distribution: https://en.wikipedia.org/wiki/Poisson_distribution

[6]

Splat simulation parameters (bioconductor.org): https://bioconductor.org/packages/devel/bioc/vignettes/splatter/inst/doc/splat_params.html#27_Dropout_parameters

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2022-04-06,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 北野茶缸子 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 准备工作
  • 1-功能介绍
  • 2-SplatParams对象
    • 访问及修改参数
      • 从真实数据估计参数
      • 3-利用splat参数估计结果构建表达矩阵
      • 4-模拟的两种模式
        • 4.1-cluster
          • 4.2-path(trajectory)
          • 5-其他模拟
            • 批次
              • 其他方法
              • 6-与真实数据集比较
                • 各自相比
                  • 多比一
                    • 参考资料
                • 其他内容
                相关产品与服务
                批量计算
                批量计算(BatchCompute,Batch)是为有大数据计算业务的企业、科研单位等提供高性价比且易用的计算服务。批量计算 Batch 可以根据用户提供的批处理规模,智能地管理作业和调动其所需的最佳资源。有了 Batch 的帮助,您可以将精力集中在如何分析和处理数据结果上。
                领券
                问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档