H2O - Higher-Order pattern-discovery in High-dimensional data

Program

Preliminary Program

Invited Talks

Enno Mammen (Heidelberg University)

Strong Approximations for Robbins-Monro Procedures

The Robbins-Monro algorithm is a recursive, simulation-based stochastic procedure to approximate the zeros of a function that can be written as an expectation. It is known that under some technical assumptions, Gaussian limit theorems approximate the stochastic performance of the algorithm. Here, we are interested in strong approximations for Robbins-Monro procedures. The main tool for getting them are local limit theorems, that is, studying the convergence of the density of the algorithm. The analysis relies on a version of parametrix techniques for Markov chains converging to diffusions. The main difficulty that arises here is the fact that the drift is unbounded. The talk is based on joint work with Valentin Konakov, Moscow, and Lorick Huang, Toulouse.

Martin Bladt (University of Copenhagen)

Recent Nonparametric Advances in Jump Process Estimation

This work addresses recent advances in nonparametric inference for finite-state jump processes in both Markov and non-Markov settings. We provide flexible methods to estimate state occupation probabilities and transition mechanisms under minimal assumptions. By focusing on conditioning, we show how internal and external information can sharpen individualized and subgroup-specific predictions. To handle high-dimensional data, we introduce adaptive tree- and forest-based learning strategies. Finally, we discuss procedures for transition-rate estimation.

Martina Scolamiero (KTH Stockholm)

p-norms, matchings and functional summaries in persistence.

TBA

Alexei Onatski (University of Cambridge)

Extreme singular values of random rectangular Toeplitz matrices

We study extreme singular values of large rectangular random Toeplitz and circulant matrices with independent entries. For Toeplitz matrices, the largest singular value converges to the (2 \to 2) norm of a bilinear operator built from two scaled sine-kernel operators. For rectangular circulant matrices, it converges to 1. The smallest singular value for both Toeplitz and circulant rectangular matrices converges to zero independently of the aspect ratio. We establish a lower bound on the rate of this convergence, showing that it is faster than any polylogarithmic rate yet slower than any polynomial rate.

Contributed Talks

Marcela Mandarić (University of Split)

Application of TDA for random sets

We present a methodology for detecting outliers and testing the goodness-of-fit of random sets using topological data analysis. We construct a filtration from the sublevel sets of the signed distance function and consider various summary functions of the persistence diagrams derived from the obtained persistent homology. Outliers are detected using functional depths for the summary functions. Global envelope tests, employing these summary statistics as test statistics, were used to construct the goodness-of-fit test. We also turn to the asymptotic properties of persistence diagrams obtained from random set models. Specifically, we establish central limit theorems (CLTs) for persistence diagrams associated with germ-grain random set models. The procedures were justified by a simulation study using germ-grain random set models and application to real data concerning histological images of mastopathic and mammary cancer breast tissue.

Lujia Bai (Ruhr-University Bochum)

Uniform variance reduced simultaneous inference of time-varying correlation networks

This paper proposes a unified framework for inferring large-scale time-varying correlation networks via data-driven time-varying thresholds that can control uncertainty simultaneously. The framework allows the dimension of time series vectors to be fixed or diverging at a high polynomial rate of the sample size. It also allows the time series to exhibit changing temporal characteristics beyond stationarity without specific structural assumptions. Motivated by the practical issue that the confidence band of non-parametric estimators of correlations can exceed their natural domain [−1, 1], we propose a simple uniform variance reduction technique. When applied to the construction of a correlation network, the new device yields more accurate thresholds, which enhance the probability of recovering the time-varying network structures. We broaden the applicability of our method by developing difference-based estimators of cross-correlations that are robust to structure breaks in the time-varying mean functions, and by allowing both a fixed and a diverging number of lags in the correlation functions. We prove the asymptotic validity of the proposed method, especially in achieving accurate family-wise error control when disclosing flexible time-varying network structures. The effectiveness of our method in finite samples is demonstrated through simulation studies and data analysis.

Rikkert Hindriks (VU Amsterdam)

Implications of elliptical symmetry for four-point correlations in EEG brain oscillations

Electroencephalography (EEG) studies on brain dynamics have largely focused on second-order statistical dependencies. Although higher-order dependencies have been observed in local neural circuits, it is not known if they can be observed in EEG data. Furthermore, because the number of dependencies scales exponentially with the order, a simplifying principle is needed. We do this by exploring the implications of elliptical symmetry. In particular, elliptical symmetry implies that fourth-order complex cumulants are proportional to sums of products of second-order complex cumulants and, furthermore, that the proportionality constant equals excess kurtosis. We asses to which extent these predictions are corroborated in frequency-domain EEG data and discuss the implications for brain dynamics.

Poster session

The first afternoon will close with a poster session, where the participants present their research topics. Knowing about the broad research interests of their peers from the start of the workshop will facilitate discussions and interactions. All conference participants will be able to vote for the prize for the best poster presentation. This will provide additional incentives for excellent presentations and helps the development of early-career researchers.

Hans Reimann (Heidelberg University)

Data-Driven Impulse Control in Multiple Dimensions via Non-Parametric Estimation of the Optimal Stopping Rule

We investigate optimal stopping in multiple dimensions and corresponding non-parametric approaches for data-driven impulse control strategies. The analysis can be separated into two steps: understanding the underlying optimal stopping problem in higher dimensions for diffusion processes with known components, and constructing as well as evaluating non-parametric approaches for the case of unknown system components.

The key insights are as follows: Optimal stopping in multiple dimensions can be formulated via an operator for constructing a value function substitute with desirable properties regarding error stability in characterizing quantities. We can reliably estimate such quantities by estimating the unknown components therein. Based on these results, we propose a data-driven strategy and evaluate its proficiency.

Huixiaqing Liu (Vrije Universiteit Amsterdam)

Semiparametric Estimation of Elliptical Copula Generators for Non-Gaussian Dependency Analysis in Intracranial EEG

Understanding higher-order interactions among brain regions requires estimating multivariate probability densities, which is a task that becomes intractable in high dimensions. For functional magnetic resonance imaging (fMRI) data, the Gaussian assumption provides a convenient shortcut, but intracranial electroencephalography (iEEG) signals exhibit heavy-tailed dependencies that violate this assumption.

We address this challenge using elliptical copulas, a flexible family of dependence models that includes the Gaussian as a special case. A key property of elliptical copulas is that their entire high-dimensional structure is determined by a single one-dimensional function, the density generator, effectively reducing a high-dimensional estimation problem to a univariate one. We estimate this generator using a semiparametric kernel method and validate its accuracy on synthetic data with known ground truth. Applied to real iEEG recordings, surrogate-based hypothesis testing confirms that the observed dependencies are significantly non-Gaussian, while formal ellipticity diagnostics support the validity of the elliptical model.

This framework provides a principled, computationally tractable route to information-theoretic measures of neural interaction beyond Gaussian assumptions.

Ali Jalali (Cornell University)

Topological transitions in statistical identification under complex information loss: A new application of topological data analysis to clinical trial evaluations

Topological data analysis (TDA) has developed a geometric framework to address statistical challenges in traditional statistical inference and identification when datasets under study are large, high-dimensional, and structurally complex. In addition to introducing new tools for data scientists, TDA has contributed a broader language and set of objects for studying structural features that may emerge in data and machine learning models. Motivated by persistent challenges in clinical trials conducted in complex care settings where substantial and systematic missing data arise due to patient attrition, I conduct a study of the topology of likelihood-based statistical inference itself as data become progressively incomplete/missing. Instead of treating information loss as a reduction in effective sample size--as is current practice by practitioners--I model longitudinal missingness as a monotone deformation parameter acting directly on the geometry of the likelihood function. This formulation yields a one-parameter family of likelihood landscapes indexed by increasing information loss (proportion of follow-up outcome data missing) while holding the nominal sample size fixed. Using tools from Morse theory and Reeb graph constructions, I characterize how the critical-point structure of these surfaces evolves under realistic attrition patterns.

A key result in the simulation results is that likelihood-based identification does not degrade smoothly. Beyond a critical level of missingness, the global mode corresponding to the correctly specified parameter degenerates into a saddle-like surface and disappears, giving rise to competing local optima and a transition to biased identification regime. I further observe an inverse persistence phenomenon where the estimator associated with the unbiased parameter often exhibits lower topological persistence than biased modes arising from relatively stable, selectively observed subpopulations. I conclude that under monotone information loss, the curvature supporting the true parameter erodes more rapidly than the curvature stabilizing certain biased alternatives, rendering the latter topologically more persistent. While not claimed as a universal property, this pattern raises concerns for common statistical practices that favor the most stable or robust estimates under missing data, as such estimates may correspond to biased inferential regimes. Overall, this work demonstrates how TDA can be applied to the topology of statistical inference itself, providing new insights into identification robustness and showing how complex likelihood geometry can arise even in low-dimensional models under progressive information loss.

Patrick Bastian (Aarhus University)

TWIN: Two window inspection for online change point detection

We propose a new class of sequential change point tests, both for changes in the mean parameter and in the overall distribution function. The methodology builds on a two-window inspection scheme (TWIN), which aggregates data into symmetric samples and applies strong weighting to enhance statistical performance. The detector yields logarithmic rather than polynomial detection delays, representing a substantial reduction compared to state-of-the-art alternatives. Delays remain short, even for late changes, where existing methods perform worst. Moreover, the new procedure also attains higher power than current methods across broad classes of local alternatives.For mean changes, we further introduce a self-normalized version of the detector that automatically cancels out temporal dependence, eliminating the need to estimate nuisance parameters. The advantages of our approach are supported by asymptotic theory, simulations and an application to monitoring COVID19 data. Here, structural breaks associated with new virus variants are detected almost immediately by our new procedures.This indicates potential value for the real-time monitoring of future epidemics.Mathematically, our approach is underpinned by new exponential moment bounds for the global modulus of continuity of the partial sum process, which may be of independent interest beyond change point testing.

Mathis Rost (Chalmers University/Gothenburg University)

Likelihood Approximation for Gibbs Point Processes

Although the likelihood function of a Gibbs point process is typically intractable, it is fundamental for likelihood-based inference, likelihood ratio tests, and Bayesian analysis, making accurate likelihood approximation an important challenge.

In this talk, we present a new method for approximating the likelihood function of Gibbs point processes. Building on recent probabilistic results, we derive a novel likelihood representation expressed entirely in terms of the Papangelou conditional intensity, which is typically tractable, and the void probability, i.e., the probability that a given region contains no points.

We introduce a new algorithm for approximating these void probabilities, based on newly derived structural characteristics of void probabilities for Gibbs processes, and compare its performance to existing state-of-the-art methods. Through a simulation study, we illustrate how this approach enables faster likelihood approximation for a broad class of Gibbs models.

Albertas Dvirnas (Umeå University)

Bridging Matrix Profiles and Empirical Dynamic Modelling in the Search for Patterns and Predictions in Environmental Data

Empirical dynamical modelling (EDM) and matrix profiles offer complementary ways to discover structure in complex time series. EDM reconstructs low-dimensional attractors from high-dimensional observations, enabling local analogue forecasting and causal inference, while matrix profiles provide a scalable, domain-agnostic mechanism for fast motif discovery, anomaly detection, and nearest-neighbour search. This poster explores how these two perspectives can be combined to analyse high-dimensional environmental data, such as multi-species environmental DNA (eDNA) time series.

By interpreting matrix profile subsequences as embedded states in EDM’s reconstructed phase space, we obtain a unified framework for identifying recurrent dynamical patterns and constructing local, interpretable forecasts. The approach naturally extends to streaming settings, where incremental updates to the matrix profile support real-time pattern tracking and prediction as new observations arrive. We illustrate this with examples in seasonal environmental monitoring, highlighting how the joint use of matrix profiles and EDM can reveal candidate mechanisms, regime shifts, and nonlinear dependencies that are obscured by purely statistical or purely mechanistic models. The goal is to position this synergy as a practical toolkit for exploratory analysis and prediction in modern, high-dimensional environmental datasets.

Maximilian Rücker (Ulm University)

Estimation and Inference in High-Dimensional Panel Data Models with Interactive Fixed Effects

We develop new econometric methods for estimation and inference in high-dimensional panel data models with interactive fixed effects. Our approach can be regarded as a non-trivial extension of the very popular common correlated effects (CCE) approach. Roughly speaking, we proceed as follows: We first construct a projection device to eliminate the unobserved factors from the model by applying a dimensionality reduction transform to the matrix of cross-sectionally averaged covariates. The unknown parameters are then estimated by applying lasso techniques to the projected model. For inference purposes, we derive a desparsified version of our lasso-type estimator. While the original CCE approach is restricted to the low-dimensional case where the number of regressors is small and fixed, our methods can deal with both low- and high-dimensional situations where the number of regressors is large and may even exceed the overall sample size. We derive theory for our estimation and inference methods both in the large-T-case, where the time series length T tends to infinity, and in the small-T-case, where T is a fixed natural number. Specifically, we derive the convergence rate of our estimator and show that its desparsified version is asymptotically normal under suitable regularity conditions. The theoretical analysis of the paper is complemented by a simulation study and an empirical application to characteristic based asset pricing.

Daniel Peer (University of Vienna)

On the Edgeworth expansion of the maxima and the blessings of dimensionality

Let $X_1,\ldots, X_n \in \mathbb{R}^d$ be a sequence of i.i.d. random vectors, where $d$ may be potentially much larger than $n$. A fundamental problem in high-dimensional statistics concerns normal approximations and convergence properties of the maximum statistic $$M_n=\max_{1\leq k\leq d} \frac{1}{\sqrt{n}}\sum_{i=1}^n X_{i,k},$$ whose study was initiated in seminal works by Chernozhukov, Chetverikov and Kato. A next step in understanding the asymptotic properties of $M_n$ and accompanying quantile approximations is the development of Edgeworth-type expansions and corresponding bootstrap methods. A very recent result in this direction was established by Koike, developing an Edgeworth expansion for $\frac{1}{\sqrt{n}}\sum_{i=1}^n X_i$ based on Stein kernels, subject to some regularity conditions. In our project, we view the problem through the lens of Poisson-approximations to directly construct an Edgeworth expansion for $M_n$. Our main assumptions are a Cram\'{e}r-type condition for all pairs of components of $X_i$ and a notion of weak dependence across the dimension. Utilizing this expansion, we obtain second order approximations for $\mathbb{P}(M_n\leq x)$ and the quantiles of $M_n$. Under suitable uniformity assumptions on the moments across components, we improve these convergence rates to third and higher orders. Furthermore, we extend our results to studentized case, that is to the statistic $\max_{1\leq k\leq d} T_{n,k}$, where $T_{n,k}$ are the component-wise Student-t statistics.

Daria Tieplova (Aarhus University)

Testing approximate sphericity for high-dimensional covariance matrices

Exact testing of model assumptions is often of limited relevance, especially in high-dimensional settings. Structural assumptions on large-dimensional covariance matrices such as sphericity are rarely expected to hold exactly for real data and practitioners are often primarily interested in whether such model assumptions are approximately satisfied. In this work, we propose a test for approximate sphericity of high-dimensional covariance matrices, where the tolerated level of deviation from sphericity can be chosen by the user. Our test statistic is based on estimators of the largest and smallest eigenvalue of the population covariance matrix in a high-dimensional regime, where the corresponding sample eigenvalues are not consistent. We derive theoretical guarantees showing that the test keeps the prescribed asymptotic level under the null hypothesis and is power consistent under the alternative. Our key theoretical contribution is a joint central limit theorem for the estimators of the extreme eigenvalues of the population covariance matrix, provided the corresponding eigenvalues exceed the critical phase transition threshold.

Page updated

Google Sites

Report abuse