#
Statistical methodology

Maximilian Linde

Jorge Tendeiro

Eric-Jan Wagenmakers

Don van Ravenzwaaij

Some important research questions require the ability to find evidence for two conditions being practically equivalent. This is impossible to accomplish within the traditional frequentist null hypothesis significance testing framework; hence, other methodologies must be utilized. We explain and illustrate three approaches for finding evidence for equivalence: The frequentist two one-sided tests procedure, the Bayesian highest density interval region of practical equivalence procedure, and the Bayes factor interval null procedure. We compare the classification performances of these three approaches for various plausible scenarios. The results indicate that the Bayes factor interval null approach compares favorably to the other two approaches in terms of statistical power. Critically, compared to the Bayes factor interval null procedure, the two one-sided tests and the highest density interval region of practical equivalence procedures have limited discrimination capabilities when the sample size is relatively small: specifically, in order to be practically useful, these two methods generally require over 250 cases within each condition when rather large equivalence margins of approximately 0.2 or 0.3 are used; for smaller equivalence margins even more cases are required. Because of these results, we recommend that researchers rely more on the Bayes factor interval null approach for quantifying evidence for equivalence, especially for studies that are constrained on sample size.

Abstract: Accurate measurement requires maximising the correlation between true scores and measured scores. Classical psychometric concepts such as construct validity and reliability are often difficult to apply in experimental contexts. To overcome this challenge, calibration has recently been suggested as generic framework for experimental research. In this approach, a calibration experiment is performed to impact the latent attribute in question. The a priori intended true scores can then serve as criterion, and their correlation with measured scores, termed retrodictive validity, is used to evaluate a measurement method. It has been shown that under plausible assumptions, increasing retrodictive validity is guaranteed to increase measurement accuracy. Since calibration experiments will be performed in finite samples, it is desirable to design them in a way that minimises the sample variance of retrodictive validity estimators. This is the topic of the current note. For arbitrary distributions of true and measured scores, we analytically derive the asymptotic variance of the sample estimator of retrodictive validity. We analyse qualitatively how different distribution features impact on estimator variance. Then, we numerically simulate asymptotic and finite-sample estimator variance for various distributions with combinations of feature values. We find that it is preferable to use uniformly distributed (if possible discrete) experimental treatments in calibration experiments. Secondly, inverse sigmoid systematic aberration has a large impact on estimator variance. Finally, reducing imprecision aberration decreases estimator variance in many but not all scenarios. From these findings, we derive recommendations for the design and for resource investment in calibration experiments.

Julien Musolino

Prof. Pernille Hemmer

Intentional Binding (IB), the subjective underestimation of the time interval between a voluntary action and its associated outcome, is standardly regarded as an implicit measure of the sense of agency. Here, we reanalyzed results from a publicly available IB experiment (Weller et al., 2020) to evaluate three alternative explanations for their results: sequential dependencies, memory (i.e., regression to the mean), and boundary effects. The dataset contained subjective estimates of outcomes for time intervals of 100, 400, and 700ms. Aggregate results revealed overestimation for 100 and 400ms intervals and underestimation for 700ms. Controlling for sequential dependencies did not change this pattern of results. We then modeled the data using a simple Bayesian model of memory to evaluate the role of expectations over temporal intervals. Summary statistics extracted from the data were used as parameters in the model. The simulation produced a pattern of regression to the mean qualitatively similar to the observed data. Model simulations reproduced the behavioral data for the two longer time intervals, but slightly underestimated the observed overestimation at the shortest time interval. We ruled out IB as the explanation for this overestimation at the shortest time interval since the hallmark of IB is underestimation. Instead, a boundary effect likely accounts for the overestimation. In sum, the results from this dataset can be fully accounted for as manifestations of memory (i.e., regression to the mean) and a boundary effect. Crucially, no appeal to intentional binding or agency measurements of any kind are necessary.

Alexander John Etz

Michael Lee

If two models account for data equally well, it is widely accepted that we should select the simplest one. One way to formalize this principle is through measures of model complexity that quantify the range of outcomes a model predicts. According to Roberts and Pashler (2000), however, this is only part of the story. They emphasize that a simple model is one that is falsifiable because it makes surprising predictions, which requires measuring how likely it is that data could have been observed that the model does not predict. We propose a new measure that includes both of these criteria, based on Kullback-Leibler (KL) divergence. Our measure involves the models’ prior predictive distributions, which corresponds to the range of predictions they make, and a data prior, which corresponds to the range of possible observable outcomes in an experiment designed to evaluate the models. We propose that model A is simpler than model B if the KL divergence from the prior predictive distribution of model A to the data prior is greater than that of model B. This measure formalizes the idea that a model is simpler if its predictions are more surprising and more falsifiable. To demonstrate this new measure, we present a worked example involving competing models of the widely-studied memory process of free recall. The example involves a data prior based on the empirical regularity provided by the serial position curve. We show how the data prior helps measure aspects of model complexity not captured by measuring the range of predictions made by models, and influences which model is chosen.

Dr. Craig Stark

Dr. Shauna Stark

Michael Lee

The Mnemonic Similarity Task (MST: Stark et al., 2019) is a modified recognition memory task designed to place strong demand on pattern separation. The sensitivity and reliability of the MST make it an extremely valuable tool in clinical settings, where it has been used to identify hippocampal dysfunction associated with healthy aging, dementia, schizophrenia, depression, and other disorders. As with any test used in a clinical setting, it is especially important for the MST to be administered as efficiently as possible. We apply adaptive design optimization methods (Myung et al., 2013) to optimize the presentation of test stimuli in accordance with previous responses.This optimization is based on a novel signal-detection model of an individual’s memory capabilities and decision-making processes. We demonstrate that the cognitive model is able to describe people’s behavior and measure their ability to separate patterns. We also demonstrate that the adaptive design optimization approach generally significantly reduces the number of test stimuli needed to provide these measures.

Submitting author

Author