首页
学习
活动
专区
圈层
工具
发布
社区首页 >专栏 >统计学学术速递[8.16]

统计学学术速递[8.16]

作者头像
公众号-arXiv每日学术速递
发布2021-08-24 14:29:01
发布2021-08-24 14:29:01
6390
举报

Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

stat统计学,共计18篇

【1】 Non-parametric estimation of cumulative (residual) extropy with censored observations 标题:删失观测值下累积(残差)外部性的非参数估计 链接:https://arxiv.org/abs/2108.06324

作者:Sudheesh K. K.,Sreedevi E. P 机构:Indian Statistical Institute, Chennai, India, SNGS College, Pattambi, India. 备注:We discuss several open problems associated with dynamic and weighted extropy measures 摘要:探索了Extropy及其性质,以量化不确定性。在本文中,我们得到了累积剩余外向和累积外向的替代表达式。我们得到了累积(剩余)外向性的简单估计。研究了估计量的渐近性质。当数据包含右删失观测值时,我们还提出了累积(剩余)外向性的新估计。通过蒙特卡罗模拟研究,评估了估计量的有限样本性能。我们使用所提出的估计来分析不同的实际数据。最后,我们讨论了几个与动态和加权外向测度相关的开放问题。 摘要:Extropy and its properties are explored to quantify the uncertainty. In this paper, we obtain alternative expressions for cumulative residual extropy and cumulative extropy. We obtain simple estimators of cumulative (residual) extropy. Asymptotic properties of the proposed estimators are studied. We also present new estimators of cumulative (residual) extropy when the data contains right censored observations. Finite sample performance of the estimators is evaluated through Monte Carlo simulation studies. We use the proposed estimators to analyze different real data. Finally, we discuss several open problems associated with dynamic and weighted extropy measures.

【2】 The application of sub-seasonal to seasonal (S2S) predictions for hydropower forecasting 标题:亚季-季(S2S)预报在水电预报中的应用 链接:https://arxiv.org/abs/2108.06269

作者:Robert M. Graham,Jethro Browell,Douglas Bertram,Christopher J. White 机构:. Department of Electronic and Electrical Engineering, University of Strathclyde, Glasgow, . Department of Civil and Environmental Engineering, University of Strathclyde, Glasgow, . School of Mathematics and Statistics, University of Glasgow, Key words: 摘要:流入量预测在水电站水库管理中起着至关重要的作用。预测有助于运营商提前安排发电,以最大限度地提高经济价值,缓解下游洪水风险,并满足环境要求。业务流入预测的范围通常限制在2周以内,这标志着确定性天气预测的可预测性障碍。分季节至季节(S2S)范围内的可靠流入预测将使运营商能够采取主动行动,降低不利天气条件的风险,从而改善水资源管理并增加收入。本研究概述了使用苏格兰高地水库案例研究得出熟练S2S流入预测的方法。我们通过将观测流入的线性回归模型训练到欧洲中期天气预报中心(ECMWF)的S2S集合降水预测,生成集合流入预测。随后,应用集合模型输出统计数据的后处理技术,在不应用单独水文模型的情况下,得出经校准的S2S概率流入预测。我们发现,S2S概率流入预测与气候预测相比,在未来6周内保持良好水平。与夏季相比,冬季的流入预测具有更高的技巧。然而,这些预测很难预测夏季的高流入量,即使是在较短的筹备时间内。S2S概率流入预测在改善水管理和提供更高经济价值方面的潜力通过一个程式化成本模型得到证实。在应用于水电预测时,本文给出的结果和方法与水管理和S2S预测应用的更广泛领域相关。 摘要:Inflow forecasts play an essential role in the management of hydropower reservoirs. Forecasts help operators schedule power generation in advance to maximise economic value, mitigate downstream flood risk, and meet environmental requirements. The horizon of operational inflow forecasts is often limited in range to ~2 weeks ahead, marking the predictability barrier of deterministic weather forecasts. Reliable inflow forecasts in the sub-seasonal to seasonal (S2S) range would allow operators to take proactive action to mitigate risks of adverse weather conditions, thereby improving water management and increasing revenue. This study outlines a method of deriving skilful S2S inflow forecasts using a case study reservoir in the Scottish Highlands. We generate ensemble inflow forecasts by training a linear regression model for the observed inflow onto S2S ensemble precipitation predictions from the European Centre for Medium-range Weather Forecasting (ECMWF). Subsequently, post-processing techniques from Ensemble Model Output Statistics are applied to derive calibrated S2S probabilistic inflow forecasts, without the application of a separate hydrological model. We find the S2S probabilistic inflow forecasts hold skill relative to climatological forecasts up to 6 weeks ahead. The inflow forecasts hold greater skill during winter compared with summer. The forecasts, however, struggle to predict high summer inflows, even at short lead-times. The potential for the S2S probabilistic inflow forecasts to improve water management and deliver increased economic value is confirmed using a stylised cost model. While applied to hydropower forecasting, the results and methods presented here are relevant to broader fields of water management and S2S forecasting applications.

【3】 Data-driven advice for interpreting local and global model predictions in bioinformatics problems 标题:用于解释生物信息学问题中的局部和全局模型预测的数据驱动的建议 链接:https://arxiv.org/abs/2108.06201

作者:Markus Loecher,Qi Wu 机构:Berlin School of Economics and Law, Berlin, Germany, Key Words: variable importance; random forests; trees; Gini impurity;SHAP values. 摘要:基于树的算法,如随机森林和梯度增强树,仍然是跨多个学科使用的最流行和最强大的机器学习模型之一。在基于树的模型中估计特征影响的传统智慧是测量损失函数的{节点减少},其(i)仅产生全局重要性度量,并且(ii)已知存在严重偏差。条件特征贡献(CFC)通过遵循决策路径并将模型预期输出的变化归因于路径上的每个特征,提供预测的逐案解释。然而,Lundberg等人指出,CFC的潜在偏差取决于与树根的距离。目前非常流行的替代方案,SHapley加法解释(SHAP)值似乎可以缓解这种偏差,但计算成本要高得多。在这里,我们对这两种方法计算的164个公开的分类问题的解释进行了彻底的比较,以便为当前的研究人员提供数据驱动的算法建议。对于随机森林,我们发现本地和全球SHAP值和CFC分数具有极高的相似性和相关性,导致非常相似的排名和解释。类似的结论支持使用全局特征重要性得分作为每个特征相关预测能力的代理的保真度。 摘要:Tree-based algorithms such as random forests and gradient boosted trees continue to be among the most popular and powerful machine learning models used across multiple disciplines. The conventional wisdom of estimating the impact of a feature in tree based models is to measure the \textit{node-wise reduction of a loss function}, which (i) yields only global importance measures and (ii) is known to suffer from severe biases. Conditional feature contributions (CFCs) provide \textit{local}, case-by-case explanations of a prediction by following the decision path and attributing changes in the expected output of the model to each feature along the path. However, Lundberg et al. pointed out a potential bias of CFCs which depends on the distance from the root of a tree. The by now immensely popular alternative, SHapley Additive exPlanation (SHAP) values appear to mitigate this bias but are computationally much more expensive. Here we contribute a thorough comparison of the explanations computed by both methods on a set of 164 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. For random forests, we find extremely high similarities and correlations of both local and global SHAP values and CFC scores, leading to very similar rankings and interpretations. Analogous conclusions hold for the fidelity of using global feature importance scores as a proxy for the predictive power associated with each feature.

【4】 Stochastic orders and measures of skewness and dispersion based on expectiles 标题:基于预期的偏度和离散度的随机序与测度 链接:https://arxiv.org/abs/2108.06138

作者:Andreas Eberl,Bernhard Klar 机构:Institute of Stochastics, Karlsruhe Institute of Technology (KIT), Germany. 摘要:最近,引入了基于期望值的偏度度量,这些度量具有非常有前景的特性(Eberl和Klar,2021,2020)。然而,这些度量是否保持van Zwet的凸变换阶,这是偏度度量的基本要求,仍然没有答案。这些度量是使用期望间距进行缩放的。在这里,不清楚这些可变性度量是否保持分散有序。本文的主要目的是肯定地回答这两个问题。此外,我们还详细研究了期望区间。 摘要:Recently, expectile-based measures of skewness have been introduced which possess quite promising properties (Eberl and Klar, 2021, 2020). However, it remained unanswered if these measures preserve the convex transformation order of van Zwet, which is a basic requirement for a measure of skewness. These measures are scaled using interexpectile distances. Here, it is not clear if these measures of variability preserve the dispersive ordering. It is the main aim of this paper to answer both questions in the affirmative. Moreover, we study the interexpectile range in some detail.

【5】 Multi-Stage Graph Peeling Algorithm for Probabilistic Core Decomposition 标题:概率核分解的多阶段图剥离算法 链接:https://arxiv.org/abs/2108.06094

作者:Yang Guo,Xuekui Zhang,Fatemeh Esfahani,Venkatesh Srinivasan,Alex Thomo,Li Xing 机构:Dept. of Mathematics and Statistics, University of Victoria, Victoria, BC, V,W ,Y, Canada, Dept. of Computer Science, University of Saskatchewan, Saskatoon, SK, S,N ,A, Canada 摘要:在分析图时,挖掘顶点彼此紧密连接的稠密子图是一项常见的任务。子图分析中一个非常流行的概念是核心分解。最近,Esfahani等人提出了一种基于图剥离和中心极限定理(CLT)的概率核分解算法,该算法能够处理非常大的图。他们提出的剥离算法(PA)从最低阶顶点开始,递归删除这些顶点,分配核心编号,并更新相邻顶点的阶数,直到达到最大核心。然而,在许多应用中,特别是在生物学中,可以从稠密的亚群落中获得更有价值的信息,我们对顶点与其他顶点不太相互作用的小核心不感兴趣。为了使前一个PA更加关注稠密子图,我们提出了一种多级图剥离算法(M-PA),该算法在前一个PA之前添加了两级数据筛选过程。在根据用户定义的阈值从图中移除顶点后,我们可以在不影响我们感兴趣的子图中的顶点的情况下,大大降低图的复杂度。我们证明了M-PA比前一个PA更有效,并且在适当设置过滤阈值的情况下,可以生成与前一个PA非常相似的稠密子图(在图密度和聚类系数方面)。 摘要:Mining dense subgraphs where vertices connect closely with each other is a common task when analyzing graphs. A very popular notion in subgraph analysis is core decomposition. Recently, Esfahani et al. presented a probabilistic core decomposition algorithm based on graph peeling and Central Limit Theorem (CLT) that is capable of handling very large graphs. Their proposed peeling algorithm (PA) starts from the lowest degree vertices and recursively deletes these vertices, assigning core numbers, and updating the degree of neighbour vertices until it reached the maximum core. However, in many applications, particularly in biology, more valuable information can be obtained from dense sub-communities and we are not interested in small cores where vertices do not interact much with others. To make the previous PA focus more on dense subgraphs, we propose a multi-stage graph peeling algorithm (M-PA) that has a two-stage data screening procedure added before the previous PA. After removing vertices from the graph based on the user-defined thresholds, we can reduce the graph complexity largely and without affecting the vertices in subgraphs that we are interested in. We show that M-PA is more efficient than the previous PA and with the properly set filtering threshold, can produce very similar if not identical dense subgraphs to the previous PA (in terms of graph density and clustering coefficient).

【6】 Fréchet single index models for object response regression 标题:目标响应回归的Fréchet单指数模型 链接:https://arxiv.org/abs/2108.06058

作者:Aritra Ghosal,Wendy Meiring,Alexander Petersen 机构:Department of Statistics and Applied Probability, University of California Santa Barbara, Department of Statistics, Brigham Young University 备注:Presented at JSM on August 9, 2021 摘要:随着越来越多的非欧几里德数据对象的可用性,统计学家面临着开发适当的统计方法的任务。对于预测值位于$\R^p$中且响应变量位于度量空间中的回归模型,可使用条件Fr'echet均值定义Fr'echet回归函数。最近,作为多元线性回归和局部线性回归的扩展,已经开发了全局和局部Fr方法来建模和估计该回归函数。本文对这些方法进行了扩展,提出了Fr’echet单指数(FSI)模型,并利用局部Fr’echet和$M$-估计来估计指数和基础回归函数。该方法通过对单位球体表面上响应对象的模拟以及对人类死亡率数据的分析来说明,其中可生存数据由死亡年龄分布表示,被视为Wasserstein分布空间的元素。 摘要:With the availability of more non-euclidean data objects, statisticians are faced with the task of developing appropriate statistical methods. For regression models in which the predictors lie in $\R^p$ and the response variables are situated in a metric space, conditional Fr\'echet means can be used to define the Fr\'echet regression function. Global and local Fr\'echet methods have recently been developed for modeling and estimating this regression function as extensions of multiple and local linear regression, respectively. This paper expands on these methodologies by proposing the Fr\'echet Single Index (FSI) model and utilizing local Fr\'echet along with $M$-estimation to estimate both the index and the underlying regression function. The method is illustrated by simulations for response objects on the surface of the unit sphere and through an analysis of human mortality data in which lifetable data are represented by distributions of age-of-death, viewed as elements of the Wasserstein space of distributions.

【7】 Generalized Bayes Estimators with Closed forms for the Normal Mean and Covariance Matrices 标题:正态均值和协方差矩阵的闭式广义Bayes估计 链接:https://arxiv.org/abs/2108.06041

作者:Ryota Yuasa,Tatsuya Kubokawa 机构:and 摘要:在多元正态分布的均值矩阵估计中,给出了具有闭合形式的广义Bayes估计,并给出了它们相对于矩阵和标量二次损失函数的极小性的充分条件。给出了协方差矩阵的广义Bayes估计的封闭形式,讨论了Stein损失函数的优势性质。 摘要:In the estimation of the mean matrix in a multivariate normal distribution, the generalized Bayes estimators with closed forms are provided, and the sufficient conditions for their minimaxity are derived relative to both matrix and scalar quadratic loss functions. The generalized Bayes estimators of the covariance matrix are also given with closed forms, and the dominance properties are discussed for the Stein loss function.

【8】 Statistical Learning using Sparse Deep Neural Networks in Empirical Risk Minimization 标题:基于稀疏深度神经网络的统计学习在经验风险最小化中的应用 链接:https://arxiv.org/abs/2108.05990

作者:Shujie Ma,Mingming Liu 摘要:我们考虑稀疏LRIU网络(SDRN)估计从经验风险最小化与Lipschitz损失函数在存在大量的特征。我们的框架可以应用于各种回归和分类问题。假设要估计的未知目标函数位于Korobov空间中。该空间中的函数只需要满足平滑条件,而不需要具有组合结构。我们发展了SDRN估计的非渐近超额风险界。我们进一步推导出,当特征维数固定时,SDRN估计可以达到与一维非参数回归相同的极大极小估计率(高达对数因子),并且当维数随样本大小增长时,估计具有次优估计率。我们证明了ReLU网络的深度、节点总数和权重需要随着样本量的增加而增加,以确保良好的性能,并且还研究了它们随着样本量的增加而增加的速度。这些结果为深层神经网络的实证研究提供了重要的理论指导和基础。 摘要:We consider a sparse deep ReLU network (SDRN) estimator obtained from empirical risk minimization with a Lipschitz loss function in the presence of a large number of features. Our framework can be applied to a variety of regression and classification problems. The unknown target function to estimate is assumed to be in a Korobov space. Functions in this space only need to satisfy a smoothness condition rather than having a compositional structure. We develop non-asymptotic excess risk bounds for our SDRN estimator. We further derive that the SDRN estimator can achieve the same minimax rate of estimation (up to logarithmic factors) as one-dimensional nonparametric regression when the dimension of the features is fixed, and the estimator has a suboptimal rate when the dimension grows with the sample size. We show that the depth and the total number of nodes and weights of the ReLU network need to grow as the sample size increases to ensure a good performance, and also investigate how fast they should increase with the sample size. These results provide an important theoretical guidance and basis for empirical studies by deep neural networks.

【9】 Change Point Analysis of Multivariate Data via Multivariate Rank-based Distribution-free Nonparametric Testing Using Measure Transportation 标题:基于测度传输的多元秩无分布非参数检验的多变量数据变点分析 链接:https://arxiv.org/abs/2108.05979

作者:Amanda Ng 机构:The Bronx High School of Science, Bodhisattva Sen, Mentor; Department of Statistics, Columbia University 备注:21 pages and 2 figures 摘要:在这篇文章中,我提出了一种基于度量传输定义的秩概念的多变量无分布非参数检验多变化点分析的通用算法。多元秩和通常的一维秩都有一个重要的特性:它们都是无分布的。这一发现允许在零假设下创建无分布的非参数检验。这里,我将考虑在多点变化问题的上下文中的秩能量统计。我将估计多变量时序观测序列中变化点的数量及其位置。本文将在观测分布和变化点数量不确定的大背景下研究多个变化点问题,而不是像该领域的许多工作所假设的那样,假设时间序列观测遵循参数模型或存在一个变化点。目标是开发识别变更点的技术,同时尽可能少地进行假设。这里描述的算法基于能量统计,能够检测任何分布变化。给出了这种新算法的理论性质和估计变化点的近似数目及其位置的条件。这种新提出的算法可用于分析各种数据集,包括金融和微阵列数据。该算法也已在CRAN上可用的R包recp中成功实现。本文的一节专门介绍此过程的执行以及recp包的使用。 摘要:In this paper, I propose a general algorithm for multiple change point analysis via multivariate distribution-free nonparametric testing based on the concept of ranks that are defined by measure transportation. Multivariate ranks and the usual one-dimensional ranks both share an important property: they are both distribution-free. This finding allows for the creation of nonparametric tests that are distribution-free under the null hypothesis. Here I will consider rank energy statistics in the context of the multiple change point problem. I will estimate the number of change points and each of their locations within a multivariate series of time-ordered observations. This paper will examine the multiple change point question in a broad setting in which the observed distributions and number of change points are unspecified, rather than assume the time series observations follow a parametric model or there is one change point, as many works in this area assume. The objective is to develop techniques for identifying change points while making as few presumptions as possible. This algorithm described here is based upon energy statistics and has the ability to detect any distributional change. Presented are the theoretical properties of this new algorithm and the conditions under which the approximate number of change points and their locations can be estimated. This newly proposed algorithm can be used to analyze various datasets, including financial and microarray data. This algorithm has also been successfully implemented in the R package recp, which is available on CRAN. A section of this paper is dedicated to the execution of this procedure, as well as the use of the recp package.

【10】 Efficient reduced-rank methods for Gaussian processes with eigenfunction expansions 标题:特征函数展开的高斯过程的高效降秩方法 链接:https://arxiv.org/abs/2108.05924

作者:Philip Greengard,Michael O'Neil 机构:Columbia University, New York, NY , Michael O’Neil, Courant Institute, NYU, Research supported by Alfred P. Sloan Foundation., Research supported in part by the Office of Naval Research under award numbers #N,-,-,-, and the Simons, FoundationSFARI (, AB). 摘要:在这项工作中,我们介绍了高斯过程回归的降秩算法。我们的数值方案将用户指定区间上的高斯过程转换为其Karhunen Lo\`eve展开,即$L^2$-最优降秩表示。Karhunen-Lo\`eve展开的数值计算在预计算期间执行一次,包括计算积分算子的数值特征分解,其核心是高斯过程的协方差函数。Karhunen-Lo\`eve展开与观测数据无关,仅取决于协方差核和定义高斯过程的区间大小。本文的方案不需要协方差核的平移不变性。我们还介绍了一类用于超参数贝叶斯拟合的快速算法,并通过一维和二维数值实验验证了算法的性能。向更高维度的扩展在数学上很简单,但受到高维度标准诅咒的影响。 摘要:In this work we introduce a reduced-rank algorithm for Gaussian process regression. Our numerical scheme converts a Gaussian process on a user-specified interval to its Karhunen-Lo\`eve expansion, the $L^2$-optimal reduced-rank representation. Numerical evaluation of the Karhunen-Lo\`eve expansion is performed once during precomputation and involves computing a numerical eigendecomposition of an integral operator whose kernel is the covariance function of the Gaussian process. The Karhunen-Lo\`eve expansion is independent of observed data and depends only on the covariance kernel and the size of the interval on which the Gaussian process is defined. The scheme of this paper does not require translation invariance of the covariance kernel. We also introduce a class of fast algorithms for Bayesian fitting of hyperparameters, and demonstrate the performance of our algorithms with numerical experiments in one and two dimensions. Extensions to higher dimensions are mathematically straightforward but suffer from the standard curses of high dimensions.

【11】 On stochastic expansions of empirical distribution function of residuals in autoregression schemes 标题:关于自回归方案中残差经验分布函数的随机展开式 链接:https://arxiv.org/abs/2108.05903

作者:Michael Boldin 备注:14 pages 摘要:我们考虑一个具有未知均值的平稳线性AR($p$)模型。创新的自回归参数以及分布函数(d.f.)$G$未知。观测值包含粗差(异常值)。异常值的分布是未知和任意的,其强度为$\gamma n^{-1/2}$,具有未知的$\gamma$,$n$是样本量。在这种情况下,最重要的问题是测试创新的正常性。众所周知,正态性确保了广泛使用的最小二乘法的最优性。为了构造和研究正态性的Pearson卡方检验,我们估计了未知均值和自回归参数。然后,利用这些估计,我们找到了自回归中的残差。在此基础上,我们构造了一种经验分布函数(r.e.d.f.),它与自回归创新的(不可访问的)e.d.f.相对应。我们的皮尔逊分布是r.e.d.f.的泛函。其在假设下的渐近分布和局部替代由r.e.d.f.的渐近行为决定。因此,研究r.e.d.f.的渐近性质是一项自然而有意义的任务。在目前的工作中,我们发现并详细证实了两种情况下r.e.d.f.的随机展开。在第一种情况下,d.f.$G(x)$的创新不依赖于$n$。我们需要这个结果来研究假设下的检验统计量。在第二种情况下,$G(x)$依赖于$n$,并且具有混合形式$G(x)=a_n(x)=(1-n^{-1/2})G_0(x)+n^{-1/2}H(x)。$我们需要这个结果来研究在局部方案下测试的威力。 摘要:We consider a stationary linear AR($p$) model with unknown mean. The autoregression parameters as well as the distribution function (d.f.) $G$ of innovations are unknown. The observations contain gross errors (outliers). The distribution of outliers is unknown and arbitrary, their intensity is $\gamma n^{-1/2}$ with an unknown $\gamma$, $n$ is the sample size. The assential problem in such situation is to test the normality of innovations. Normality, as is known, ensures the optimality properties of widely used least squares procedures. To construct and study a Pearson chi-square type test for normality we estimate the unknown mean and the autoregression parameters. Then, using the estimates, we find the residuals in the autoregression. Based on them, we construct a kind of empirical distribution function (r.e.d.f.) , which is a counterpart of the (inaccessible) e.d.f. of the autoregression innovations. Our Pearson's satatistic is the functional from r.e.d.f. Its asymptotic distributions under the hypothesis and the local alternatives are determined by the asymptotic behavior of r.e.d.f. %Therefore, the study of the asymptotic properties of r.e.d.f. is a natural and meaningful task. In the present work, we find and substantiate in details the stochastic expansions of the r.e.d.f. in two situations. In the first one d.f. $ G (x) $ of innovations does not depend on $ n $. We need this result to investigate test statistic under the hypothesis. In the second situation $ G (x) $ depends on $ n $ and has the form of a mixture $ G (x) = A_n (x) = (1-n ^ {- 1/2}) G_0 (x) + n ^ { -1/2} H (x). $ We need this result to study the power of test under the local alternatives.

【12】 Self-Calibrating the Look-Elsewhere Effect: Fast Evaluation of the Statistical Significance Using Peak Heights 标题:自校准视-别处效应:利用峰值高度快速评估统计显著性 链接:https://arxiv.org/abs/2108.06333

作者:Adrian E. Bayer,Uros Seljak,Jakob Robnik 机构:Berkeley Center for Cosmological Physics, University of California, Berkeley, CA , USA, Department of Physics, University of California, Berkeley, CA , USA, Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA , USA 备注:12 pages, 7 figures 摘要:在为异常搜索大参数空间的实验中,经常会在可能性中发现许多由杂散噪声引起的峰值。这就是所谓的“别处看”效应,在进行统计分析时必须加以纠正。本文介绍了一种通过考虑概率中最高峰值的高度来校准给定数据集的虚警概率(FAP)或$p$-值的方法。在最简单的自校准形式中,物理峰值的$\chi^2$近似为峰值的$\chi^2$减去最高噪声诱发峰值的$\chi^2$。推广这个概念来考虑较低的峰值提供了一种快速的方法来量化统计意义,并提高了精度。与其他方法相比,这种方法的计算成本可以忽略不计,因为可能性峰值是每个峰值搜索分析的副产品。我们应用于天文学的例子,包括行星探测、周期图和宇宙学。 摘要:In experiments where one searches a large parameter space for an anomaly, one often finds many spurious noise-induced peaks in the likelihood. This is known as the look-elsewhere effect, and must be corrected for when performing statistical analysis. This paper introduces a method to calibrate the false alarm probability (FAP), or $p$-value, for a given dataset by considering the heights of the highest peaks in the likelihood. In the simplest form of self-calibration, the look-elsewhere-corrected $\chi^2$ of a physical peak is approximated by the $\chi^2$ of the peak minus the $\chi^2$ of the highest noise-induced peak. Generalizing this concept to consider lower peaks provides a fast method to quantify the statistical significance with improved accuracy. In contrast to alternative methods, this approach has negligible computational cost as peaks in the likelihood are a byproduct of every peak-search analysis. We apply to examples from astronomy, including planet detection, periodograms, and cosmology.

【13】 Random Subspace Mixture Models for Interpretable Anomaly Detection 标题:可解释异常检测的随机子空间混合模型 链接:https://arxiv.org/abs/2108.06283

作者:Cetin Savkli,Catherine Schwartz 机构: Applied Physics Lab, Johns Hopkins University, Laurel, MD, USA, Department of Mathematics, University of Maryland, College Park, MD, USA 备注:None 摘要:我们提出了一种新的基于子空间的方法来构造高维数据的概率模型,并重点介绍了它在异常检测中的应用。该方法基于随机子空间密度与几何平均相结合的概率密度统计估计。在选择随机子空间时,使用每个属性的相等表示来确保正确的统计限制。高斯混合模型(GMMs)用于创建每个子空间的概率密度,包括减少奇异性的技术,允许处理数值和分类属性。通过贝叶斯信息准则自动确定每个GMM的组件数量,以防止过度拟合。与针对基准异常检测数据集的突出算法相比,该算法获得了具有竞争力的AUC分数,并且具有简单、可扩展和可解释的额外优点。 摘要:We present a new subspace-based method to construct probabilistic models for high-dimensional data and highlight its use in anomaly detection. The approach is based on a statistical estimation of probability density using densities of random subspaces combined with geometric averaging. In selecting random subspaces, equal representation of each attribute is used to ensure correct statistical limits. Gaussian mixture models (GMMs) are used to create the probability densities for each subspace with techniques included to mitigate singularities allowing for the ability to handle both numerical and categorial attributes. The number of components for each GMM is determined automatically through Bayesian information criterion to prevent overfitting. The proposed algorithm attains competitive AUC scores compared with prominent algorithms against benchmark anomaly detection datasets with the added benefits of being simple, scalable, and interpretable.

【14】 Ergodic properties of some Markov chains models in random environments 标题:随机环境中某些马氏链模型的遍历性质 链接:https://arxiv.org/abs/2108.06211

作者:Lionel Truquet 摘要:我们研究了随机环境中,当定义动态的随机马尔可夫核满足一些通常的漂移和小集条件但具有随机系数时,一些马尔可夫链模型的遍历性质。特别地,我们将用于获得齐次马尔可夫链几何遍历性质的标准耦合方案应用于随机环境情形,并根据满足某些Doeblin型条件的链的Kifer方法的精神,证明了此类链的随机不变概率测度过程的存在性。然后,当环境本身是遍历的时,我们推导出这类链的遍历性质。我们的结果通过提供关于随机马尔可夫核的非常弱且易于检查的假设来补充和锐化现有的结果。作为一个副产品,我们获得了一个框架来研究一些具有严格外生协变量的时间序列模型。我们用具有函数系数的自回归时间序列和一些阈值自回归过程来说明我们的结果。 摘要:We study ergodic properties of some Markov chains models in random environments when the random Markov kernels that define the dynamic satisfy some usual drift and small set conditions but with random coefficients. In particular, we adapt a standard coupling scheme used for getting geometric ergodic properties for homogeneous Markov chains to the random environment case and we prove the existence of a process of randomly invariant probability measures for such chains, in the spirit of the approach of Kifer for chains satisfying some Doeblin type conditions. We then deduce ergodic properties of such chains when the environment is itself ergodic. Our results complement and sharpen existing ones by providing quite weak and easily checkable assumptions on the random Markov kernels. As a by-product, we obtain a framework for studying some time series models with strictly exogenous covariates. We illustrate our results with autoregressive time series with functional coefficients and some threshold autoregressive processes.

【15】 A Unified Frequency Domain Cross-Validatory Approach to HAC Standard Error Estimation 标题:HAC标准误差估计的统一频域交叉验证法 链接:https://arxiv.org/abs/2108.06093

作者:Zhihao Xu,Clifford M. Hurvich 机构:Department of Statistics and Data Science, Yale University, Stern School of Business, New York University 备注:34 pages, 2 figures 摘要:我们提出了一种统一的频域交叉验证(FDCV)方法来获得HAC标准误差。我们提出的方法允许同时跨参数和非参数谱估计选择模型/调整参数。我们的候选类包括基于限制最大似然(REML)的自回归谱估计和具有Parzen核的滞后权重估计。我们提供了一种有效计算自回归模型的REML估计的方法。在仿真中,我们证明了我们的FDCV方法与Andrews Monahan和Newey West流行的HAC估计方法相比的可靠性。这篇文章的补充材料可以在网上找到。 摘要:We propose a unified frequency domain cross-validation (FDCV) method to obtain an HAC standard error. Our proposed method allows for model/tuning parameter selection across parametric and nonparametric spectral estimators simultaneously. Our candidate class consists of restricted maximum likelihood-based (REML) autoregressive spectral estimators and lag-weights estimators with the Parzen kernel. We provide a method for efficiently computing the REML estimators of the autoregressive models. In simulations, we demonstrate the reliability of our FDCV method compared with the popular HAC estimators of Andrews-Monahan and Newey-West. Supplementary material for the article is available online.

【16】 Efficient force field and energy emulation through partition of permutationally equivalent atoms 标题:基于置换等价原子划分的有效力场和能量仿真 链接:https://arxiv.org/abs/2108.06072

作者:Hao Li,Musen Zhou,Jessalyn Sebastian,Jianzhong Wu,Mengyang Gu 机构:)Department of Statistics and Applied Probability, University of California, Santa Barbara, CA , USA, )Department of Chemical and Environmental Engineering, University of California, Riverside, CA 摘要:为了克服分子动力学模拟的计算瓶颈,满足能量守恒的核岭回归(KRR)是预测力场和分子势的常用方法。然而,由于大协方差矩阵的反演,KRR的计算复杂度随着训练样本中原子数和模拟构型的乘积而成倍增加,这限制了其在小分子模拟中的应用。在这里,我们介绍了原子化力场(AFF)模型,该模型需要更少的计算成本,以达到预测原子力和势能的量子化学精度水平。通过对力场协方差核矩阵的数据驱动划分和对势能的诱导输入估计方法,大大降低了机器学习算法的计算复杂度,并保持了较高的预测精度。高效的机器学习算法在相同的计算预算下扩展了其在大分子上的应用范围。使用MD17数据集和另一个较大分子的模拟数据集,我们证明AFF模拟器的精度范围为原子力的0.01-0.1 kcal mol$^{-1}$或能量和0.001-0.2 kcal mol$^{-1}$\要求{mediawiki texvc}$$\AA^{-1}$。最重要的是,通过不到5分钟的计算时间来训练AFF仿真器,并对保持的分子构型进行预测,从而实现了精度。此外,我们的方法包含原子力和势预测的不确定性评估,有助于在几乎不增加计算成本的情况下,在化学输入空间上开发顺序设计。 摘要:Kernel ridge regression (KRR) that satisfies energy conservation is a popular approach for predicting forcefield and molecular potential, to overcome the computational bottleneck of molecular dynamics simulation. However, the computational complexity of KRR increases cubically as the product of the number of atoms and simulated configurations in the training sample, due to the inversion of a large covariance matrix, which limits its applications to the simulation of small molecules. Here, we introduce the atomized force field (AFF) model that requires much less computational costs to achieve the quantum-chemical level of accuracy for predicting atomic forces and potential energies. Through a data-driven partition on the covariance kernel matrix of the force field and an induced input estimation approach on potential energies, we dramatically reduce the computational complexity of the machine learning algorithm and maintain high accuracy in predictions. The efficient machine learning algorithm extends the limits of its applications on larger molecules under the same computational budget. Using the MD17 dataset and another simulated dataset on larger molecules, we demonstrate that the accuracy of the AFF emulator ranges from 0.01-0.1 kcal mol$^{-1}$ or energies and 0.001-0.2 kcal mol$^{-1}$ $\require{mediawiki-texvc}$$\AA^{-1}$ for atomic forces. Most importantly, the accuracy was achieved by less than 5 minutes of computational time for training the AFF emulator and for making predictions on held-out molecular configurations. Furthermore, our approach contains uncertainty assessment of predictions of atomic forces and potentials, useful for developing a sequential design over the chemical input space, with nearly no increase of computational costs.

【17】 Scalable3-BO: Big Data meets HPC - A scalable asynchronous parallel high-dimensional Bayesian optimization framework on supercomputers 标题:Scalable3-BO:大数据满足HPC--超级计算机上可扩展的异步并行高维贝叶斯优化框架 链接:https://arxiv.org/abs/2108.05969

作者:Anh Tran 机构:Optimization and Uncertainty Quantification, Sandia National Laboratories, Albuquerque, NM 摘要:贝叶斯优化(BO)是一种灵活而强大的框架,适用于计算成本高昂的基于仿真的应用,并保证统计收敛到全局最优。虽然仍然是最流行的优化方法之一,但其性能受到数据大小、所考虑问题的维数和顺序优化性质的限制。这些可伸缩性问题相互交织,必须同时解决。在这项工作中,我们提出了可扩展的$^3$-BO框架,该框架采用稀疏GP作为底层代理模型来处理大数据,并配备了一个随机嵌入来有效地优化高维问题和低有效维问题。可扩展的$^3$-BO框架进一步利用异步并行化功能,在计算预算内充分利用HPC上的计算资源。因此,建议的可伸缩$^3$-BO框架可以从三个独立的角度进行伸缩:数据大小、维度和HPC上的计算资源。这项工作的目标是将BO的前沿超越其众所周知的可伸缩性问题,并将优化高维计算昂贵的应用程序所需的挂钟等待时间降至最低。我们演示了在HPC环境中,可扩展的$^3$-BO具有100万个数据点、10000个维度的问题和20个并发工作者的能力。 摘要:Bayesian optimization (BO) is a flexible and powerful framework that is suitable for computationally expensive simulation-based applications and guarantees statistical convergence to the global optimum. While remaining as one of the most popular optimization methods, its capability is hindered by the size of data, the dimensionality of the considered problem, and the nature of sequential optimization. These scalability issues are intertwined with each other and must be tackled simultaneously. In this work, we propose the Scalable$^3$-BO framework, which employs sparse GP as the underlying surrogate model to scope with Big Data and is equipped with a random embedding to efficiently optimize high-dimensional problems with low effective dimensionality. The Scalable$^3$-BO framework is further leveraged with asynchronous parallelization feature, which fully exploits the computational resource on HPC within a computational budget. As a result, the proposed Scalable$^3$-BO framework is scalable in three independent perspectives: with respect to data size, dimensionality, and computational resource on HPC. The goal of this work is to push the frontiers of BO beyond its well-known scalability issues and minimize the wall-clock waiting time for optimizing high-dimensional computationally expensive applications. We demonstrate the capability of Scalable$^3$-BO with 1 million data points, 10,000-dimensional problems, with 20 concurrent workers in an HPC environment.

【18】 Higher-Order Expansion and Bartlett Correctability of Distributionally Robust Optimization 标题:分布鲁棒优化的高阶展开和Bartlett校正 链接:https://arxiv.org/abs/2108.05908

作者:Shengyi He,Henry Lam 机构: Department of Industrial Engineering and Operations Research, Columbia University 摘要:分布式鲁棒优化(DRO)是不确定条件下随机优化的最坏情况框架,近年来得到了快速发展的研究。当潜在的概率分布未知且从数据中观察到时,DRO建议在所谓的不确定性集合中计算最坏情况分布,该集合捕获了相关的统计不确定性。特别是,将不确定集构造为统计散度邻域球的DRO为构造非参数泛函的有效置信区间提供了一种工具,并与经验似然(EL)具有对偶性。在本文中,我们展示了如何调整这种类型的DRO的球大小可以减少类似于Bartlett校正的高阶覆盖误差。我们的修正,适用于一般的von Mises可微泛函,比现有的EL文献更一般,它只关注光滑函数模型或$M$-估计。此外,我们还证明了DRO的高阶“自规范化”特性,无论散度的选择如何。我们的方法建立在DRO高阶展开式的基础上,该展开式是通过对由Karush-Kuhn-Tucker条件产生的不动点方程进行渐近分析得到的。 摘要:Distributionally robust optimization (DRO) is a worst-case framework for stochastic optimization under uncertainty that has drawn fast-growing studies in recent years. When the underlying probability distribution is unknown and observed from data, DRO suggests to compute the worst-case distribution within a so-called uncertainty set that captures the involved statistical uncertainty. In particular, DRO with uncertainty set constructed as a statistical divergence neighborhood ball has been shown to provide a tool for constructing valid confidence intervals for nonparametric functionals, and bears a duality with the empirical likelihood (EL). In this paper, we show how adjusting the ball size of such type of DRO can reduce higher-order coverage errors similar to the Bartlett correction. Our correction, which applies to general von Mises differentiable functionals, is more general than the existing EL literature that only focuses on smooth function models or $M$-estimation. Moreover, we demonstrate a higher-order "self-normalizing" property of DRO regardless of the choice of divergence. Our approach builds on the development of a higher-order expansion of DRO, which is obtained through an asymptotic analysis on a fixed point equation arising from the Karush-Kuhn-Tucker conditions.

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2021-08-16,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档