đ Publications
* : Equal contribution â : Corresponding author.
2025
- arXivGeodesic Difference-in-DifferencesarXiv preprint arXiv:2501.17436, 2025
Difference-in-differences (DID) is a widely used quasi-experimental design for causal inference, traditionally applied to scalar or Euclidean outcomes, while extensions to outcomes residing in non-Euclidean spaces remain limited. Existing methods for such outcomes have primarily focused on univariate distributions, leveraging linear operations in the space of quantile functions, but these approaches cannot be directly extended to outcomes in general metric spaces. In this paper, we propose geodesic DID, a novel DID framework for outcomes in geodesic metric spaces, such as distributions, networks, and manifold-valued data. To address the absence of algebraic operations in these spaces, we use geodesics as proxies for differences and introduce the geodesic average treatment effect on the treated (ATT) as the causal estimand. We establish the identification of the geodesic ATT and derive the convergence rate of its sample versions, employing tools from metric geometry and empirical process theory. This framework is further extended to the case of staggered DID settings, allowing for multiple time periods and varying treatment timings. To illustrate the practical utility of geodesic DID, we analyze health impacts of the Soviet Unionâs collapse using age-at-death distributions and assess effects of U.S. electricity market liberalization on electricity generation compositions.
@article{zhou:25:2, title = {Geodesic Difference-in-Differences}, author = {Zhou, Yidong and Kurisu, Daisuke and Otsu, Taisuke and M{\"u}ller, Hans-Georg}, journal = {arXiv preprint arXiv:2501.17436}, year = {2025}, }
2024
- JCGSWasserstein-Kaplan-Meier Survival RegressionYidong Zhou, and Hans-Georg MĂŒllerJournal of Computational and Graphical Statistics, 2024
Survival analysis plays a pivotal role in medical research, offering valuable insights into the timing of events such as survival time. One common challenge in survival analysis is the necessity to adjust the survival function to account for additional factors, such as age, gender, and ethnicity. We propose an innovative regression model for right-censored survival data across heterogeneous populations, leveraging the Wasserstein space of probability measures. Our approach models the probability measure of survival time and the corresponding non-parametric Kaplan-Meier estimator for each subgroup as elements of the Wasserstein space. The Wasserstein space provides a flexible framework for modeling heterogeneous populations, allowing us to capture complex relationships between covariates and survival times. We address an underexplored aspect by deriving the non-asymptotic convergence rate of the Kaplan-Meier estimator to the underlying probability measure in terms of the Wasserstein metric. The proposed model is supported with a solid theoretical foundation including pointwise and uniform convergence rates, along with an efficient algorithm for model fitting. The proposed model effectively accommodates random variation that may exist in the probability measures across different subgroups, demonstrating superior performance in both simulations and two case studies compared to the Cox proportional hazards model and other alternative models.
@article{zhou:24, title = {Wasserstein-{K}aplan-{M}eier Survival Regression}, author = {Zhou, Yidong and M{\"u}ller, Hans-Georg}, journal = {Journal of Computational and Graphical Statistics}, pages = {1--11}, year = {2024}, publisher = {Taylor \& Francis}, }
- BiometricsWasserstein Regression with Empirical Measures and Density Estimation for Sparse DataYidong Zhou, and Hans-Georg MĂŒllerBiometrics, 2024
The problem of modeling the relationship between univariate distributions and one or more explanatory variables lately has found increasing interest. Existing approaches proceed by substituting proxy estimated distributions for the typically unknown response distributions. These estimates are obtained from available data but are problematic when for some of the distributions only few data are available. Such situations are common in practice and cannot be addressed with currently available approaches, especially when one aims at density estimates. We show how this and other problems associated with density estimation such as tuning parameter selection and bias issues can be side-stepped when covariates are available. We also introduce a novel version of distribution-response regression that is based on empirical measures. By avoiding the preprocessing step of recovering complete individual response distributions, the proposed approach is applicable when the sample size available for each distribution varies and especially when it is small for some of the distributions but large for others. In this case, one can still obtain consistent distribution estimates even for distributions with only few data by gaining strength across the entire sample of distributions, while traditional approaches where distributions or densities are estimated individually fail, since sparsely sampled densities cannot be consistently estimated. The proposed model is demonstrated to outperform existing approaches through simulations and Environmental Influences on Child Health Outcomes (ECHO) data.
@article{zhou:24:2, title = {Wasserstein Regression with Empirical Measures and Density Estimation for Sparse Data}, author = {Zhou, Yidong and M{\"u}ller, Hans-Georg}, journal = {Biometrics}, volume = {80}, number = {4}, pages = {ujae127}, year = {2024}, }
- JRSSBDynamic Modelling of Sparse Longitudinal Data and Functional Snippets with Stochastic Differential EquationsYidong Zhou, and Hans-Georg MĂŒllerJournal of the Royal Statistical Society Series B: Statistical Methodology, 2024
This paper received the 2024 IMS Hannan Graduate Student Travel Award.
Sparse functional/longitudinal data have attracted widespread interest due to the prevalence of such data in social and life sciences. A prominent scenario where such data are routinely encountered are accelerated longitudinal studies, where subjects are enrolled in the study at a random time and are only tracked for a short amount of time relative to the domain of interest. The statistical analysis of such functional snippets is challenging since information for far-off-diagonal regions of the covariance structure is missing. Our main methodological contribution is to address this challenge by bypassing covariance estimation and instead modelling the underlying process as the solution of a data-adaptive stochastic differential equation. Taking advantage of the interface between Gaussian functional data and stochastic differential equations makes it possible to efficiently reconstruct the target process by estimating its dynamic distribution. The proposed approach allows one to consistently recover forward sample paths from functional snippets at the subject level. We establish the existence and uniqueness of the solution to the proposed data-driven stochastic differential equation and derive rates of convergence for the corresponding estimators. The finite sample performance is demonstrated with simulation studies and functional snippets arising from a growth study and spinal bone mineral density data.
@article{zhou:24:3, title = {Dynamic Modelling of Sparse Longitudinal Data and Functional Snippets with Stochastic Differential Equations}, author = {Zhou, Yidong and M{\"u}ller, Hans-Georg}, journal = {Journal of the Royal Statistical Society Series B: Statistical Methodology}, pages = {qkae116}, year = {2024}, publisher = {Oxford University Press UK}, }
- arXivDeep FrĂ©chet RegressionSu I Iao* , Yidong Zhou*, and Hans-Georg MĂŒllerarXiv preprint arXiv:2407.21407, 2024
This paper was selected as a 2025 Student Paper Award Finalist in the Nonparametric Section of the American Statistical Association.
Advancements in modern science have led to the increasing availability of non-Euclidean data in metric spaces. This paper addresses the challenge of modeling relationships between non-Euclidean responses and multivariate Euclidean predictors. We propose a flexible regression model capable of handling high-dimensional predictors without imposing parametric assumptions. Two primary challenges are addressed: the curse of dimensionality in nonparametric regression and the absence of linear structure in general metric spaces. The former is tackled using deep neural networks, while for the latter we demonstrate the feasibility of mapping the metric space where responses reside to a low-dimensional Euclidean space using manifold learning. We introduce a reverse mapping approach, employing local FĂ©chet regression, to map the low-dimensional manifold representations back to objects in the original metric space. We develop a theoretical framework, investigating the convergence rate of deep neural networks under dependent sub-Gaussian noise with bias. The convergence rate of the proposed regression model is then obtained by expanding the scope of local FĂ©chet regression to accommodate multivariate predictors in the presence of errors in predictors. Simulations and case studies show that the proposed model outperforms existing methods for non-Euclidean responses, focusing on the special cases of probability measures and networks.
@article{zhou:24:4, title = {Deep Fr{\'e}chet Regression}, author = {Iao, Su I and Zhou, Yidong and M{\"u}ller, Hans-Georg}, journal = {arXiv preprint arXiv:2407.21407}, year = {2024}, }
- arXivGeodesic Causal InferencearXiv preprint arXiv:2406.19604, 2024
Adjusting for confounding and imbalance when establishing statistical relationships is an increasingly important task, and causal inference methods have emerged as the most popular tool to achieve this. Causal inference has been developed mainly for regression relationships with scalar responses and also for distributional responses. We introduce here a general framework for causal inference when responses reside in general geodesic metric spaces, where we draw on a novel geodesic calculus that facilitates scalar multiplication for geodesics and the quantification of treatment effects through the concept of geodesic average treatment effect. Using ideas from FrĂ©chet regression, we obtain a doubly robust estimation of the geodesic average treatment effect and results on consistency and rates of convergence for the proposed estimators. We also study uncertainty quantification and inference for the treatment effect. Examples and practical implementations include simulations and data illustrations for responses corresponding to compositional responses as encountered for U.S. statewise energy source data, where we study the effect of coal mining, network data corresponding to New York taxi trips, where the effect of the COVID-19 pandemic is of interest, and the studying the effect of Alzheimerâs disease on connectivity networks.
@article{zhou:24:5, title = {Geodesic Causal Inference}, author = {Kurisu, Daisuke and Zhou, Yidong and Otsu, Taisuke and M{\"u}ller, Hans-Georg}, journal = {arXiv preprint arXiv:2406.19604}, year = {2024}, }
2023
- SRNetwork Evolution of Regional Brain Volumes in Young Children Reflects Neurocognitive Scores and Motherâs EducationYidong Zhou, Hans-Georg MĂŒller, Changbo Zhu, Yaqing Chen, Jane-Ling Wang, Jonathan OâMuircheartaigh, Muriel Bruchhage, and Sean DeoniScientific Reports, 2023
The maturation of regional brain volumes from birth to preadolescence is a critical developmental process that underlies emerging brain structural connectivity and function. Regulated by genes and environment, the coordinated growth of different brain regions plays an important role in cognitive development. Current knowledge about structural network evolution is limited, partly due to the sparse and irregular nature of most longitudinal neuroimaging data. In particular, it is unknown how factors such as motherâs education or sex of the child impact the structural network evolution. To address this issue, we propose a method to construct evolving structural networks and study how the evolving connections among brain regions as reflected at the network level are related to maternal education and biological sex of the child and also how they are associated with cognitive development. Our methodology is based on applying local FrĂ©chet regression to longitudinal neuroimaging data acquired from the RESONANCE cohort, a cohort of healthy children (245 females and 309 males) ranging in age from 9 weeks to 10 years. Our findings reveal that sustained highly coordinated volume growth across brain regions is associated with lower maternal education and lower cognitive development. This suggests that higher neurocognitive performance levels in children are associated with increased variability of regional growth patterns as children age.
@article{zhou:23, title = {Network Evolution of Regional Brain Volumes in Young Children Reflects Neurocognitive Scores and Mother's Education}, author = {Zhou, Yidong and M{\"u}ller, Hans-Georg and Zhu, Changbo and Chen, Yaqing and Wang, Jane-Ling and O'Muircheartaigh, Jonathan and Bruchhage, Muriel and Deoni, Sean}, journal = {Scientific Reports}, volume = {13}, number = {1}, pages = {2984}, year = {2023}, publisher = {Nature Publishing Group UK London}, }
2022
- JMAALearning Delay Dynamics for Multivariate Stochastic Processes, with Application to the Prediction of the Growth Rate of COVID-19 Cases in the United StatesParomita Dubey, Yaqing Chen*, Ălvaro Gajardo*, Satarupa Bhattacharjee*, Cody Carroll* , Yidong Zhou* , Han Chen*, and Hans-Georg MĂŒllerJournal of Mathematical Analysis and Applications, 2022
Delay differential equations form the underpinning of many complex dynamical systems. The forward problem of solving random differential equations with delay has received increasing attention in recent years. Motivated by the challenge to predict the COVID-19 caseload trajectories for individual states in the U.S., we target here the inverse problem. Given a sample of observed random trajectories obeying an unknown random differential equation model with delay, we use a functional data analysis framework to learn the model parameters that govern the underlying dynamics from the data. We show the existence and uniqueness of the analytical solutions of the population delay random differential equation model when one has discrete time delays in the functional concurrent regression model and also for a second scenario where one has a delay continuum or distributed delay. The latter involves a functional linear regression model with history index. The derivative of the process of interest is modeled using the process itself as predictor and also other functional predictors with predictor-specific delayed impacts. This dynamics learning approach is shown to be well suited to model the growth rate of COVID-19 for the states that are part of the U.S., by pooling information from the individual states, using the case process and concurrently observed economic and mobility data as predictors.
@article{dubey:21, title = {Learning Delay Dynamics for Multivariate Stochastic Processes, with Application to the Prediction of the Growth Rate of {COVID}-19 Cases in the {U}nited {S}tates}, author = {Dubey, Paromita and Chen, Yaqing and Gajardo, {\'A}lvaro and Bhattacharjee, Satarupa and Carroll, Cody and Zhou, Yidong and Chen, Han and M{\"u}ller, Hans-Georg}, journal = {Journal of Mathematical Analysis and Applications}, volume = {514}, number = {2}, pages = {125677}, year = {2022}, publisher = {Elsevier}, }
- JMLRNetwork Regression with Graph LaplaciansYidong Zhou, and Hans-Georg MĂŒllerJournal of Machine Learning Research, 2022
This paper was selected as a 2023 Student Paper Award Finalist in the Nonparametric Section of the American Statistical Association.
Network data are increasingly available in various research fields, motivating statistical analysis for populations of networks, where a network as a whole is viewed as a data point. The study of how a network changes as a function of covariates is often of paramount interest. However, due to the non-Euclidean nature of networks, basic statistical tools available for scalar and vector data are no longer applicable. This motivates an extension of the notion of regression to the case where responses are network data. Here we propose to adopt conditional Fréchet means implemented as M-estimators that depend on weights derived from both global and local least squares regression, extending the Fréchet regression framework to networks that are quantified by their graph Laplacians. The challenge is to characterize the space of graph Laplacians to justify the application of Fréchet regression. This characterization then leads to asymptotic rates of convergence for the corresponding M-estimators by applying empirical process methods. We demonstrate the usefulness and good practical performance of the proposed framework with simulations and with network data arising from resting-state fMRI in neuroimaging, as well as New York taxi records.
@article{zhou:22:2, title = {Network Regression with Graph {L}aplacians}, author = {Zhou, Yidong and M{\"u}ller, Hans-Georg}, journal = {Journal of Machine Learning Research}, volume = {23}, number = {320}, pages = {1--41}, year = {2022}, }