Why Discovering 'Nothing' in Science Can Be So Incredibly Important

Why discovering 'nothing' in science can be so incredibly important

In science, as in life, we all like to celebrate the big news.

We confirmed the existence of black holes by the ripples they create in space time . We photographed the shadow of a black hole . We figured out how to edit DNA . We found the Higgs boson !

What we don't usually hear about is the years of back-breaking, painstaking hard work that delivers inconclusive results, appearing to provide no evidence for the questions scientists ask – the incremental application of constraints that bring us ever closer to finding answers and making discoveries.

Yet without non-detections – what we call the null result – the progress of science would often be slowed and stymied. Null results drive us forward. They keep us from repeating the same errors, and shape the direction of future studies.

There is, in fact, much that we can learn from nothing.

Often, however, null results don't make it to scientific publications. This not only can generate significant inefficiencies in the way science is done, it's an indicator of potentially bigger problems in the current scientific publication processes.

"We know that there is this powerfully distorting effect of failing to publish null results," psychologist Marcus Munafò of the University of Bristol told ScienceAlert.

"Solving the problem isn't straightforward, because it's very easy to generate a null result by running a bad experiment. If we were to flood the literature with even more null results by generating poor-quality studies, that wouldn't necessarily help the fundamental problem, which is, ultimately, to get the right answers to important questions."

Defining the problem

The null hypothesis defines the parameters under which the results of a study would be indistinguishable from background noise. Gravitational wave interferometry is a nice, neat example: The signals produced by gravitational waves are very faint, and there are many sources of noise that can affect LIGO's sensors. A confirmed detection could only be made once those sources were conclusively ruled out.

If those sources cannot be ruled out, that is what's called a null result. That doesn't mean gravitational waves were not detected; it just means we can't determine that we've made a detection with any certainty.

virgo interferometer

This can be really useful, and in some fields – like cosmology, and gravitational wave astronomy – the publication of null results helps scientists to adjust the parameters of future experiments.

In other fields, where results can be more qualitative than quantitative, null results are less valued.

"Part of the problem in a lot of behavioral and medical science is that we can't make quantitative predictions," Munafò explained.

"So, we're just looking for evidence that there is an effect or an association, irrespective of its magnitude, which then creates this problem when, if we fail to find evidence that there is an effect, we haven't put any parameters on whether or not an effect that small would actually matter – biologically, theoretically, clinically. We can't do anything with it."

An extraordinary nothing

When wielded correctly, a null result can yield some extraordinary findings.

One of the most famous examples is the Michelson-Morley experiment , conducted by physicists Albert A. Michelson and Edward W. Morley in 1887. The pair were attempting to detect the velocity of our planet with respect to 'luminiferous aether' – the medium through which light was thought to travel, much as waves travel through water.

As Earth moved through space, they hypothesized, oncoming waves of light rippling through a perfectly still, Universe-wide ocean of aether should move at a slightly different speed to those rippling out at right angles to it. Their experiments were ingenious and painstaking , but of course, they detected nothing of the sort. The null result showed that the speed of light was constant in all reference frames, which Einstein would go on to explain with his special theory of relativity.

michelson inteferometer

In other instances, null results can help us to design instrumentation and future experiments. The detection of colliding black holes via gravitational waves only took place after years of null detections allowed for improvements to the design of the gravitational wave interferometer. While at CERN, physicists have so far made no detection of a dark matter signal in particle collision experiments, which has allowed constraints to be placed on what it could be.

"Null experiments are just a part of the full range of observations," astrophysicist George Smoot III of UC Berkeley told ScienceAlert. "Sometimes you see something new and amazing and sometimes you see there is not."

When it's all about hard numbers, null results are often easier to interpret. In other fields, there can be little incentive to publish.

The implications of non-detection aren't always clear, and studies that do make a significant finding receive more attention, more funding, and are more likely to be cited. Clinical trials with positive results are more likely to be published than those with negative or null results . When it comes to deciding who is going to get a research grant, these things matter.

Scientists, too, are very busy people, with many potential lines of inquiry they could be pursuing. Why chase the null hypothesis when you could be using your time doing research that is more likely to be seen and lead to further research opportunities?

To publish or null

As well as leaving out important context that could help us learn something new about our world, the non-publication of null results can also lead to inefficiency – and, worse, could even discourage young scientists from pursuing a career, as Munafò found first-hand. As a young PhD student, he set about replicating an experiment that had found a certain effect, and thought his results would naturally be the same.

"And it didn't work. I didn't find that effect in my experiment. So as an early career researcher, you think, well, I must have done something wrong, maybe I'm not cut out for this," he said.

"I was fortunate enough to bump into a senior academic who said, 'Oh, yeah, no one could replicate that finding'. If you've been in the field long enough, you find out about this stuff through conversations at conferences, and your own experiences, and so on. But you have to stay in the field long enough to find that out. If you aren't lucky enough to have that person tell you that it's not your fault, it's just the fact that the finding itself is pretty flaky, you might end up leaving the field."

Academic publishing has been grappling with this problem, too. In 2002, a unique project – the Journal of Negative Results in BioMedicine – was established to encourage the publication of results that might not otherwise see the light of day. It closed in 2017, claiming to have succeeded in its mission, as many other journals had followed its lead in publishing more articles with negative or null results.

However, encouraging scientists to bring their negative results to light may sometimes prove close to fruitless. On the one hand, there's the potential for a glut of poorly conceived, poorly designed, poorly conducted studies. But the opposite is possible, too.

In 2014, the Journal of Business and Psychology published a null results special issue, and received surprisingly few submissions . This, the editors deduced, could be because scientists themselves are conditioned to believe that null results are worthless. In 2019, the Berlin Institute of Health announced a reward for replication studies, explicitly welcoming null results , yet only received 22 applications.

These attitudes could change. We've seen that it can happen; Smoot, for instance, has gleaned a great deal of insight from null detections.

crab nebula

"Search for antimatter in the cosmic rays – that was a null experiment and convinced me that there was no serious amount of antimatter in our galaxy and likely on a much larger scale, even though there was a great symmetry between matter and antimatter," he said.

"Next null experiment was testing for violation of angular momentum and rotation of the Universe. While it is conceivable, the null result is very important to our worldview and cosmology view and it was the initial motivation for me to use the cosmic microwave background radiation to observe and measure the Universe. That led to more null results, but also some major discoveries."

Ultimately, it might be a slow process. Publication needs to incentivize not null results in and of themselves, but studies designed in such a way that these results can be interpreted and published in their appropriate context. By no means a trivial ask, but one crucial to scientific progress.

"Getting the right answer to the right question matters," Munafò said.

"And sometimes that will mean null results. But I think we need to be careful not to make the publication of a null result an end in itself; it's a means to an end, if it helps us get to the right answer, but it needs more than just the publication of null results to get there.

"Ultimately, what we need are better formulated questions and better designed studies so that our results are solid and informative, irrespective of what they are."

Score Card Research NoScript

10 Things Your Null Result Might Mean

After the excitement and hard work of running a field experiment is over, it’s not uncommon to hear policymakers and researchers express disappointment when they end up hearing that the intervention did not have a detectable impact. This guide explains that a null result rarely means “the intervention didn’t work,” even though that tends to be the shorthand many people use. Instead, a null result can reflect the myriad design choices that policy implementers and researchers make in the course of developing and testing an intervention. After all, people tend to label hypothesis tests with high p-values as “null results”, and hypothesis tests (as summaries of information about design and data) can produce large p-values for many reasons. Policymakers can make better decisions about what to do with a null result when they understand how and why they got that result.

Imagine you lead the department of education for a government and are wondering about how to boost student attendance. You decide to consider a text message intervention that offers individual students counseling. Counselors at each school can help students address challenges specifically related to school attendance. Your team runs a randomized trial of the intervention, and tells you there is a null result.

How should you understand the null result, and what should you do about it? It could be a result of unmet challenges at several stages of your work – in the way the intervention is designed, the way the intervention is implemented, or the way study is designed Below are 10 things to consider when interpreting your null result.

1 Intervention Design

1.1 1. your intervention theory and approach are mismatched to the problem..

You delivered a counseling intervention because you thought that students needed support to address challenges in their home life. However, students who had the greatest needs never actually met with a counselor, in part because they did not trust adults at the school. The theory of change assumed that absenteeism was a function primarily of a student’s personal decisions or family circumstances and that the offer of counseling without changes to school climate would be sufficient; it did not account appropriately for low levels of trust in teacher-student relationships. Therefore, this null effect does not suggest that counseling per se cannot boost attendance, but that counseling in the absence of other structural or policy changes or in the context of low-trust schools may not be sufficient.

How can you tell if …you have a mismatch between your theory of change and the problem that needs to be solved? List all potential barriers and consider how they connect. Does the intervention as designed address only one of those barriers, and, if so, can it succeed without addressing others? Are there assumptions made about one source or one cause that may undermine the success of the intervention?

1.2 2. Your intervention strength or dosage is too low for the problem or outcome of interest.

After talking to experts, you learn that counseling interventions can build trust, but usually require meetings that are more frequent and regular than your intervention offered to have the potential for an effect. Maybe your “dose” of services is too small.

How can you tell if …you did not have a sufficient “dose”? Even if no existing services tackle your problem of interest, consider what is a minimum level, strength, or dose that is both feasible to implement and could yield an effect. When asking sites what they are willing to take on, beware of defaulting to the lowest dose. The more complex the problem or outcome is to move, the stronger or more comprehensive the intervention may need to be.

1.3 3. Your intervention does not represent a large enough enhancement over usual services.

In your position at the state department of education, you learn that students at the target schools were already receiving some counseling and support services. Even though the existing services were not sufficient to boost attendance to the targeted levels, the new intervention did not add enough content or frequency of the counseling services to reach those levels either—the intervention yielded show-up rates that were about the same as existing services. So this null effect does not reflect that counseling has no effect, but rather that the version of counseling your intervention offered was not effective over and above existing counseling services.

How can you tell if …the relative strength of your intervention was not sufficient to yield an effect? Take stock of the structure and content of existing services, and consider if the extent or form in which clients respond to existing services indicates that the theory of change or approach needs to be revised. If the theory holds, use existing services as a benchmark and consider whether your proposed intervention needs to include something supplementary and/or something complementary.

2 Intervention Implementation

Programs rarely rollout exactly as intended, but some variations are more problematic than others.

2.1 4. Your implementation format was not reliable.

In the schools in your study, counseling interventions sometimes occurred in person, sometimes happened by text message, sometimes by phone. Anticipating and allowing for some variation and adaptation is important. Intervention dosage and strength is often not delivered as designed nor to as many people as expected.

But unplanned variations in format can reflect a host of selection bias issues, such that you cannot disentangle whether counseling as a concept does not work or whether certain formats of outreach did not work. This is especially important to guard against if you intend to test specific channels or mechanisms critical to your theory of change.

How can you tell if …an unreliable format is the reason for your null? Were you able to specify or standardize formats in a checklist? Could you leave enough discretion but still incentivize fidelity? Pre-specifying what the intervention should look like can help staff and researchers monitor along the way and correct inconsistencies or deviations that may affect the results. This could include a training protocol for those implementing the intervention. If nothing was specified or no one was trained, then the lack of consistency may be part of the explanation.

2.2 5. Your intervention and outcome measure are mismatched to your randomization design.

You expected counseling to be more effective in schools with higher student-to-teacher ratios, but did not block randomize by class size (for more on block randomization, see our guide on 10 Things to Know About Randomization ). then it may no longer have the potential to be more effective for students in high class size schools.

How can you tell if …you have a mismatch between your intervention and randomization design? Consider whether treatment effects could vary, or service delivery might occur in a cluster , or intervention concepts could spill over , and to what extent your randomization design accounted for that.

3 Study Design

3.1 6. your study sample includes people whose behavior could not be moved by the intervention..

You ask schools to randomize students into an intervention or control (business-as-usual) group. Some students in both your intervention and control groups will always attend school, while some students will rarely attend school, regardless of what interventions are or are not offered to them. Your intervention’s success depends on not just whether students actually receive the message and/or believe it, but also on whether it can shift behavior among such potential responders.

If the proportion of potential responders is too small, then it may be difficult to detect an effect. In addition, your intervention may need to be targeted and modified in some way to address the needs of potential responders.

How can you tell if …the proportion of potential responders may be too small? Take a look at the pre-intervention attendance rate. If it is extremely low, does that rate reflect low demand or structural barriers that may limit the potential for response? Is it so high that it tells us that most people who could respond have already done so (say, 85% or higher)? Even if there is a large proportion of hypothetical potential responders, is it lower when you consider existing barriers preventing students from using counseling services that your intervention is not addressing?

3.2 7. Your measure is not validated or reliable: It varies too much and systematically across sites.

As a leader of the state’s department of education, you want to measure the effectiveness of your intervention using survey data on student attitudes related to attendance. You learn that only some schools administer a new survey measuring student attitudes, and those with surveys changed the survey items so that there is not the same wording across surveys or schools. If you observe no statistically significant difference on a survey measure that is newly developed or used by only select schools, it may be difficult to know whether the intervention “has no effect” or whether the outcome is measuring something different in each school because of different wording.

How can you tell if … outcome measurement is the problem? Check to see whether the outcome is (1) collected in the same way across your sites and (2) if it means the same thing to the participants as it means to you. In addition, check on any reporting bias and if your study participants or sites face any pressure from inside or outside of their organizations to report or answer in a particular way.

3.3 8. Your outcome is not validated or reliable: It varies too little.

Given the problems with the survey measure, you then decide to use administrative data from student records to measure whether students show up on time to school. But it turns out that schools used a generic definition of “on time” such that almost every student looks like they arrive on time. An outcome that does not have enough variation in it to detect an effect between intervention and control groups can be especially limiting if your intervention potentially could have had different effects on different types of students, but the outcome measure used in the study lacks the precision to capture the effects on different subgroups.

How can you tell if …your null result arises from measures that are too coarse or subject to response biases? Pressures to report a certain kind of outcome faced by people at your sites could again yield this kind of problem with outcome measurement. So, it is again worth investigating the meaning of the outcomes as reported by the sites from the perspective of those doing the reporting. This problem differs from the kind of ceiling and floor effects discussed elsewhere in this guide; it arises more from the strategic calculations of those producing administrative data and less from the natural behavior of those students whose behavior you are trying to change.

3.4 9. Your statistical power is insufficient to detect an effect for the intervention as implemented.

This may sound obvious to people with experience testing interventions at scale. But researchers and policymakers can fall into two traps:

Thinking about statistical significance rather than what represents a meaningful and feasible effect. Although a study with an incredibly large sample size can detect small effects with precision, one does not want to trade precision for meaning. Moreover, an intervention known to be weak during the intervention design is likely to be weaker when implemented, especially across multiple sites or months. So it may not be sufficient to simply enroll more subjects to study an intervention known to be weak (even though strong research design cannot compensate for a weak intervention in any easy or direct way);

Thinking that the only relevant test statistic for an experiment effect is a difference of means (even though we have long known that differences of means are valid but low-powered test statistics when outcomes do not neatly fit into a normal distribution).

How can you tell if …your null result arises mostly from low statistical power? Recall that statistical power depends on (a) effect size or intervention strength, (b) variability in outcomes, (c) the number of independent observations (often well measured with sample size), and (d) the test statistic you use. The previous discussions pointed out ways to learn whether an intervention you thought might be strong was weak, or whether an outcome that you thought might be clear could turn out to be very noisy.

A formal power analysis could also tell you that, given the variability in your outcome and the size of your effect, you would have needed a larger sample size to detect this effect reliably. For example, if you had known about the variability in administration of the treatment or the variability in the outcome (let alone surprises with missing data) in advance, your pre-field power analysis would have told you to use a different sample size.

A different test statistic can also change a null result into a positive result if, say, the effect is large but it is not an effect that shifts means as much as moves people who are extreme, or has the effect of making moderate students extreme. A classic example of this problem occurs with outcomes that have very long tails – such as those involving money, like annual earnings or auction spending. A t-test might produce a p-value of .20 but a rank-based test might produce a p-value of < .01 . The t-test is using evidence of a shift in averages (means) to reflect on the null hypothesis of no effects. The rank-based test is merely asking whether the treatment group outcomes tend to be bigger than (or smaller than) the control group outcomes (whether or not they differ in means).

4 Not all nulls are the result of a flaw in design or implementation!

4.1 10. your null needs to be published..

If you addressed all the issues above related to intervention design, sample size and research design, and have a precisely estimated, statistically significant null result, it is time to publish. Your colleagues and other researchers need to learn from this finding , so do not keep it to yourself.

When you have a precise null, you do not have a gap in evidence–you are generating evidence.

What can you do to convince editors and reviewers they should publish your null results? This guide should help you reason about your null results and thus explain their importance. If other studies on your topic exist, you can also contextualize your results; for example, follow some of the ideas from Abadie ( 2020 ) .

For an example, see how Bhatti et al. ( 2018 ) –in their study of a Danish governmental voter turnout intervention–used previous work on face-to-face voter turnout (reported on as a meta-analysis in Bhatti et al. ( 2016 ) ) to contextualize their own small effects.

If you are unable to find a publication willing to include a study with null results in their journal, you can still contribute to the evidence base on the policy area under examination by making your working paper, data, and/or analysis code publicly available. Many researchers choose to do so via their personal websites; in addition, there are repositories (such as the Open Science Framework ) that provide a platform for researchers to share their in-progress and unpublished work.

5 References

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Front Neurosci

When and How to Interpret Null Results in NIBS: A Taxonomy Based on Prior Expectations and Experimental Design

Tom a. de graaf.

1 Section Brain Stimulation and Cognition, Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, Netherlands

2 Maastricht Brain Imaging Centre, Maastricht, Netherlands

Alexander T. Sack

Experiments often challenge the null hypothesis that an intervention, for instance application of non-invasive brain stimulation (NIBS), has no effect on an outcome measure. In conventional statistics, a positive result rejects that hypothesis, but a null result is meaningless. Informally, however, researchers often do find null results meaningful to a greater or lesser extent. We present a model to guide interpretation of null results in NIBS research. Along a “gradient of surprise,” from Replication nulls through Exploration nulls to Hypothesized nulls, null results can be less or more surprising in the context of prior expectations, research, and theory . This influences to what extent we should credit a null result in this greater context. Orthogonal to this, experimental design choices create a “gradient of interpretability,” along which null results of an experiment, considered in isolation , become more informative. This is determined by target localization procedure, neural efficacy checks, and power and effect size evaluations. Along the latter gradient, we concretely propose three “levels of null evidence.” With caveats, these proposed levels C, B, and A, classify how informative an empirical null result is along concrete criteria. Lastly, to further inform, and help formalize, the inferences drawn from null results, Bayesian statistics can be employed. We discuss how this increasingly common alternative to traditional frequentist inference does allow quantification of the support for the null hypothesis, relative to support for the alternative hypothesis. It is our hope that these considerations can contribute to the ongoing effort to disseminate null findings alongside positive results to promote transparency and reduce publication bias.

Introduction

With advent of digital-only journals, attention the to the downsides of publication bias, and preregistration of experiments, the call for dissemination of null results becomes louder. And indeed, it appears that null results are more often and more easily accepted for publication (see current issue). We support this development not only because we believe the community should have a more complete view of performed experiments (pragmatic argument), but also because we believe null results can be meaningful (interpretability argument). Seven years ago ( de Graaf and Sack, 2011 ) we argued against a dichotomous distinction of positive and negative findings in non-invasive brain stimulation (NIBS) research, discussing criteria that could raise the interpretability of null results. We opened our paper with the familiar adage: absence of evidence is not evidence of absence. We then spent the remainder of the article arguing against it.

Of course, the core message is sound; absence of evidence does not necessarily, or always, imply evidence of absence. In classical statistics, frequentist inference, null results are formally meaningless: there was insufficient evidence in the dataset to reject the null hypothesis, but that is all one can say. The P -value reflects only the likelihood of obtaining an effect minimally as large as observed in the sample (e.g., difference between condition means), on the assumption that the null hypothesis is true. The P -value does not translate into a likelihood that the alternative hypothesis is true or false, and does not reflect the likelihood that the null hypothesis is true ( Wagenmakers, 2007 ; Wagenmakers et al., 2018b ). From a formal statistical point of view, a null result thus never constitutes evidence of absence (i.e., evidence for the null hypothesis). But that fact primarily reflects the constraints and limitations of this dominant statistical framework. Informally, null results can be less or more convincing, less or more meaningful, to an experienced researcher. In this article, we analyze on what grounds such meaning is assigned.

Brain stimulation research methodology has developed rapidly in the last decade, with increasingly sophisticated paradigms and applications ( Sandrini et al., 2010 ; Herrmann et al., 2015 ; Romei et al., 2016 ; Ten Oever et al., 2016 ). But many of those still share a fundamental aim of NIBS: to evaluate the causal role, or functional relevance, of a given brain region/mechanism. Such a brain process is often initially found to correlate to some cognitive, emotional, or behavioral function, in a neuroimaging experiment with fMRI or EEG for example, and is thus possibly of crucial importance. But it might also be epiphenomenal, perform a related but different function, or occur as a consequence of still other brain processes that co-occur and are in fact the actual substrate for the task at hand ( de Graaf et al., 2012 ). In this light, NIBS offers something unique in neuroscience: the ability to bring brain activity or specific brain mechanisms (e.g., oscillations) in specific regions or even whole networks under transient experimental control, allowing assessment of their causal relevance ( Sack, 2006 ; de Graaf and Sack, 2014 ).

But the question whether a brain process is causally relevant has two possible outcomes: either it is (positive finding), or it is not (null finding). Considered this way, an a priori rejection of any outcome that is not a positive finding, i.e., complete rejection of null results as not informative, means one can only accept one of those two outcomes. It means NIBS experiments are a waste of time and resources if the null hypothesis is in fact true. It means that confirming the hypothesis that a brain process is not functionally relevant is not possible, and thereby any experiment to test it is doomed from the start. It has traditionally also meant that NIBS experiments with null results were (more) difficult to publish, preventing transparency, and completeness in the available NIBS literature. In sum, this restrictive view severely limits the usefulness of, and range of experimental questions open to, NIBS research.

If we do not want to categorically reject NIBS null results, we need to reflect on what can make them meaningful. Certain design decisions and parameters particularly contribute to the interpretability of NIBS null results, including the localization procedure, implementation of neural efficacy checks, and power and effect size. So how can one optimize experimental design or planned data analysis approaches prior to an experiment to maximize the interpretability of potential null results? And after obtaining null results, or reading about null results in publications, how can we assess how informative they are? How much do we let them change our beliefs and inform our own work? In this article, we discuss these issues with the still dominant frequentist inference framework in mind. But we also discuss an alternative statistical framework, Bayesian inference, which is not subject to the same formal limitations. We first outline factors that make NIBS null results particularly difficult to interpret, and how to address them. We then present conceptual handholds to evaluate null results along two orthogonal gradients, leading to a classification scheme of “levels of null evidence.” Lastly, we will explain what Bayesian analysis can contribute to such evaluation in a more formal and quantitative fashion.

What Makes Null Results Difficult to Interpret in NIBS?

In de Graaf and Sack () , we discussed a perceived “dichotomy of meaningfulness” in transcranial magnetic stimulation (TMS) research: positive results were considered meaningful, negative results were considered meaningless. In part this might be attributable to the constraints of frequentist inference, but we suggested there were additional concrete arguments against null result interpretation specifically in NIBS: the localization argument, the neural efficacy argument, and the power argument. Though then focused on TMS, these arguments largely apply when it comes to other forms of NIBS, such as low-intensity transcranial electrical stimulation (TES).

According to the localization argument , one cannot be sure that the correct anatomical, or more importantly functional, area was stimulated with NIBS. Many NIBS studies still base their target localization on Talairach coordinates, MRI landmarks, or even skull landmarks. This can certainly be appropriate, scientifically and/or practically, but it means that in many participants the “functional hotspot” underlying a task/behavior might not be affected by NIBS. With TMS, error additionally contributes, with shifting or tilting coils, moving participants, or human error in initial coil placement. With TES, selecting electrode montages is not trivial. Even if a large electrode is placed on the skull, almost certainly covering an underlying functional hotspot, the exact individual anatomy as well as reference electrode placement may determine which neurons are most affected and how effectively they are modulated ( Miranda et al., 2006 ; Wagner et al., 2007 ; Bikson et al., 2010 ; Moliadze et al., 2010 ).

The neural efficacy argument appears related but reflects a separate concern. While “localization” refers to the success of cortical targeting, the “neural efficacy” argument reflects uncertainty about whether there was any effective stimulation at all. In each participant, or even the whole sample: did the NIBS actually modulate neural activity? In TMS, the infinite parameter space (number of pulses, pulse shape, pulse width, intensity, (nested) frequency, coil geometry, etc.) allows for a nearly infinite number of protocols ( Robertson et al., 2003 ), many of which might not achieve neural stimulation. Moreover, between participants anatomy differs, in terms of gyrification or distances between coil and cortex ( Stokes et al., 2005 ). The neural efficacy argument thrives in TES methodological discussions as well, with ongoing investigation on how much of the electrical current even reaches the cortex ( Vöröslakos et al., 2018 ). In short, in NIBS one can often doubt in how many participants neural stimulation/modulation was successful, aside from which cortical area was targeted.

The power argument is more general but applies to NIBS research also. Perhaps a positive finding was not obtained, simply because the experimental design lacked statistical power. Aside from the usual sources of noise in experimental data, the methodological uncertainties inherent to NIBS research, of which localization and neural efficacy are examples, only exacerbate this concern. Moreover, it appears that inter-individual differences in response to NIBS are substantial ( Maeda et al., 2000 ). For instance, inter-individual differences in network states ( Rizk et al., 2013 ), structural organization of the corpus callosum ( Chechlacz et al., 2015 ), and neuronal oscillatory parameters ( Goldsworthy et al., 2012 ; Schilberg et al., 2018 ) have been related to NIBS response. If unknown, or not taken into account, such differences can contribute strongly to reduced statistical power on the group level. The power argument leaves one with the uncomfortable question: perhaps more trials per condition, or more participants in the sample, or even a small change in experimental tasks or design, could have yielded a positive result after all. So how meaningful is a null result in NIBS research, really?

To summarize, if no effect was found: (1) perhaps the intended cortical region was not successfully targeted, (2) perhaps no neural stimulation/modulation took place in some or all participants, or (3) perhaps the experiment lacked power. These arguments indeed make it difficult to draw strong conclusions from null results in NIBS research. But in our view they do not make NIBS null results categorically uninformative (“dichotomy of meaningfulness”). Above, we argued why such a dichotomy would be unfortunate and even wasteful, given the original mission of NIBS to determine whether a brain process is, or is not , functionally relevant for a task. Below, we discuss what can be done in terms of experimental design to address these three arguments, and in turn how one can evaluate such design choices in one’s own work, or research published by others, to evaluate how informative a null result is.

What Can Make Null Results Easier to Interpret in NIBS?

We suggest that the null results of any NIBS experiment, considered in isolation, can be interpreted along a “gradient of interpretability.” Where null results fall along this gradient, i.e., how strongly we allow ourselves to interpret them, is determined by the extent to which the localization argument, neural efficacy argument, and power argument can be countered through experimental design decisions and analysis.

One way to combat the localization argument is through hardware selection. Figure-8 TMS coils stimulate more focally than some (older model) circular coils ( Hallett, 2007 ), electrical currents are more concentrated under the central electrode in a high-density TES montage compared to conventional large electrodes ( Edwards et al., 2013 ). But focality does not help if it is aimed at the wrong cortical target. The localization argument against interpreting null results becomes less problematic as the localization procedure becomes more sophisticated. We previously compared the statistical power inherent to different TMS target site localization procedures, including anatomical landmarks, Talairach coordinates, individual MRI-based landmarks, and individual fMRI-based functional localizers, using frameless stereotactic Neuronavigation to determine and maintain coil positioning ( Sack et al., 2009 ). The number of participants required to obtain a statistically significant effect (calculated based on obtained effect size for each method) rose rapidly, from only 5 using individual functional localizer scans, to 47 using the anatomical landmark (EEG 10–20 P4 location) approach. As power rises with localization procedure sophistication, for a given sample size the interpretability of a potential null result rises also (although see e.g., Wagenmakers, 2007 ; Wagenmakers et al., 2015 ). For TES, exciting developments in computational modeling of current flows, in increasingly realistic head models, similarly decrease the strength of the localization argument. Such modeling provides insight about how strongly, and where, electrical currents affect underlying neuronal tissue (e.g., Datta et al., 2009 ).

Clearly such modeling in TES also has a bearing on the neural efficacy argument: modeling increases confidence that current was sufficiently strong to reach and affect the cortical target. This is not to say that all problems are solved for TES, because firstly modeling is not trivial, and secondly a model of current density across cortex does not yet reveal what the neural effects are of this current density. Similarly for TMS, unless it is combined with simultaneous imaging (fMRI, EEG), what happens in the brain after each pulse is principally unknown to us. Maybe the protocol is insufficiently strong to induce action potentials, maybe the induced action potentials are insufficient to induce cognitive/behavioral effects, maybe the stimulation does affect the targeted area but other areas in a network compensate for the insult ( Sack et al., 2005 ). To combat at least some of these concerns, one might implement what we previously called a neural efficacy check ( de Graaf and Sack, 2011 ).

One way to see if NIBS affected brain activity, is to indeed actually measure brain activity, by combining NIBS with fMRI or EEG ( Reithler et al., 2011 ), or all together simultaneously ( Peters et al., 2013 ): the concurrent neuroimaging approach. Specifically for TMS, if the target region has a behavioral marker (motor response, phosphene perception), both the localization and neural effects of a TMS protocol can be verified using such markers independently from the behavioral tasks of interest. Hunting procedures, using independent tasks known to be affected by certain TMS protocols over the target regions, can also be used to achieve the same thing ( Oliver et al., 2009 ). Ideally, however, the experimental design includes not only the main experimental condition of interest, on which a positive or null result can be obtained, but also a second experimental condition on which a positive result is expected. For example, we applied offline rTMS to frontal cortex in a bistable perception paradigm with a passive viewing condition and a voluntary control condition using the same stimuli ( de Graaf et al., 2011b ): two experimental conditions with their own controls. TMS modulated bistable perception in the voluntary control condition (positive result) but not in the passive viewing condition (null result). The presence of the positive finding made the null result for passive bistable viewing more meaningful to us, since it inspired confidence that our TMS protocol and procedures indeed had neural effects with the potential to affect behavior.

In that study, passive bistable viewing behavior was very much unchanged. It was not a case of a small effect in the hypothesized direction not reaching significance. And this is really all one can do against the power argument: ensure sufficient power in the experimental design through a priori sample size calculations, and evaluate the effect size after data collection. The possibility that an even larger sample size, and more reliable estimation through more trials, could eventually lead to a statistically significant positive finding, is irrefutable. In fact, it is inevitable: in classical frequentist hypothesis tests, with large enough sample sizes even true null hypotheses will be rejected ( Wagenmakers, 2007 ). This is not unique to NIBS research. What may be to some extent unique to NIBS research, and indeed an exciting development, is the inclusion of individual markers that predict response to NIBS. As we continue to discover sources of inter-individual variability, we can either select participants ( Drysdale et al., 2017 ), adapt protocols ( Goldsworthy et al., 2012 ), or refine our statistical analyses to increase detection power.

A Gradient of Surprise and a Gradient of Interpretability

In Figure ​ Figure1, 1 , we distill the discussion so far into a conceptual model to help us evaluate null results in NIBS. The model contains two orthogonal axes. Horizontally, the relevant parameters and design decisions in an experiment, discussed above, are mapped along the “gradient of interpretability.” Where an experiment ranks along this dimension determines how informative its null result is, considered in isolation . Vertically, the “gradient of surprise” indicates how unexpected the null result is, in the context of prior research and theory .

An external file that holds a picture, illustration, etc.
Object name is fnins-12-00915-g001.jpg

Orthogonal gradients of surprise and interpretability. A null result in an experiment aiming to replicate a well-established finding is very surprising. A null result that was hypothesized is not at all surprising. A null result in an exploratory study with no prior expectations can be received neutrally (middle of the continuum, not shown). The level of surprise (vertical axis) essentially reflects how a null finding relates to our prior expectations. From a Bayesian perspective, even without Bayesian statistics, we should let this level of surprise (if justified based on theory or previous research) influence to what extent we let the result “change our beliefs.” This explicitly refers to interpretation of a null result in the greater context of knowledge, theory, and prior research. Orthogonal to this, one might evaluate the experiment and its parameters in terms of localization procedure, neural efficacy checks, and power and effect size, together making up a gradient of interpretability (horizontal axis). This continuum reflects how informative we should consider the null result in isolation , ignoring expectations, theory, or previous research. The figure displays our view on how design choices impact the interpretability of a null finding along this dimension (toward the right is more informative, see legend top-right). At the bottom, we schematically visualize how the collective of such “leftward” design choices can place a null result into the “Level C evidence category,” which means the null result in isolation is not very informative, while the collective of such “rightward” design choices can place a null result in the highest “Level A evidence” category, which means a null result appears informative and should be taken seriously. A few caveats are important. This figure aims to visualize concepts discussed in more detail in main text, and how they relate to each other. The visualization of Level C through Level A evidence is meant to make intuitive how they roughly fit into this overview, our proposal for what exactly differentiates Level A through C evidence is in main text. Lastly, we do not suggest that every design decision “on the right” in this figure is always best for every experiment, or that experiments yielding Level C evidence are somehow inferior. Note also that the figure reflects how design factors influence how informative null findings are, it does not apply to positive findings in a straightforward way.

Along the gradient of surprise , we find Replication nulls on one end, Exploration nulls in the middle, and Hypothesized nulls at the opposite end. Sometimes a positive finding is strongly expected, based on previous research or strong theoretical background, but not found. For instance, a null result in an experiment explicitly designed to replicate a well-established finding seems most surprising (Replication null). On the other hand, some null results can be received more neutrally, if the research was more exploratory and a result could “go either way.” For instance, because research is very original and breaks new ground in uncharted territory, or because competing theories with approximately equal support make opposite predictions for the outcome of an experiment. Thus, Exploration nulls sit in the middle of the continuum. At the other end of the gradient are Hypothesized nulls. These may be something of a rarity still, since not many experiments are explicitly designed to obtain null results (in the experimental condition). Perhaps this is because null results seem less interesting or impactful, or simply because negative findings are a priori considered meaningless for reasons discussed above. Either way, a null result that was expected is of course least surprising.

Along the gradient of interpretability, multiple factors contribute to null result interpretability. Among these the selected localization procedure, existence of neural efficacy checks, and the power and effect size. Together, these “design choices” help determine to what extent a null result can be interpreted. But these are several factors that contribute to null result interpretability along this same dimension somewhat independently, and additively. Combining this with the breadth of approaches, techniques, parameters, and even further design decisions in NIBS research, it seems the landscape of null results is rather complex. Therefore, we propose three concrete “Levels of evidence” for NIBS null results (Levels A, B, and C), as heuristic guidelines to help us determine how informative the null results from an individual experiment are. These Levels of evidence explicitly apply to the gradient of interpretability. They are visually added to Figure ​ Figure1, 1 , but only to make intuitive how they are ordered along this continuum.

Three Levels of Null Evidence

Our heuristic taxonomy of null result interpretability demarcates three levels of null evidence, from least (Level C) to most interpretable (Level A) null results. We want this taxonomy to be useful independent of Bayesian analysis, but will later comment on how Bayesian analysis fits into this scheme.

Level C evidence is assigned to null results from experiments with localization procedures not based on individual anatomy or functional mapping (i.e., landmark- or Talairach-/MNI-coordinate based) and no neural efficacy checks. Often, such null results may result from exploratory studies, studies with exploratory NIBS parameters, high risk studies, small-scale student projects, etc. Such studies might be meaningful to share with the community, for the sake of transparency. They could help others remain aware of what has been tried, which parameters, tasks, cortical sites, have been targeted under which procedures, even if the null result itself cannot be strongly interpreted (the Pragmatic argument; de Graaf and Sack, 2011 ). Opinions may differ, however, on what minimal level of sophistication an experiment requires for the ‘attempt’ to be meaningful to others and therefore published (see also de Graaf et al., 2018 ).

Level B evidence is assigned to null results from experiments with either sophisticated localization procedures (individual MRI- or fMRI guided neuronavigation for TMS, current simulations in individual anatomy for TES) or neural efficacy checks (behavioral markers, hunting procedure, concurrent neuroimaging). Thus, either through individual neuronavigation/modeling, or through a neural efficacy check, one is relatively sure that the intended cortical region was targeted . For level B evidence, it is not necessary to have the strongest form of a neural efficacy check, demonstrating behavioral effects on a related task by the NIBS protocol used also in the condition leading to the null result. In the absence of Bayesian analysis: there should be clear absence of effects in the data despite – as far as could be determined a priori – sufficient statistical power for the selected sample size. Null results with Level B evidence should, in our view, always be acceptable for publication, irrespectively of whether the null result is surprising, exploratory, or hypothesized (not publishing just because of surprise contributes to publication bias).

Level A evidence is assigned to null results with individually calibrated localization procedures (same as in level B) and strong neural efficacy checks. The neural efficacy checks should confirm not only successful targeting of the intended cortical site, but successful neural stimulation/modulation, either by demonstrating behavioral effects in an experimental condition with related stimuli/task and identical NIBS protocol, or by revealing neural effects of the NIBS intervention in appropriate regions/networks/mechanisms through concurrent neuroimaging. Level A classification further requires clear absence of effects in the experimental condition, despite sufficient statistical power for the given sample size. Level A null results constitute sufficiently meaningful contributions to warrant dissemination.

Note that these descriptions of Levels C through A reflect a first proposal, and the criteria are open to debate. It is an unavoidably flexible classification scheme, since multiple factors along the same continuum (see Figure ​ Figure1) 1 ) contribute to null result interpretability. Moreover, with continuing developments, certain design choices and procedures may fall between such classifications. For instance, is the cortex-based alignment approach for TMS localization, in which group fMRI-based Talairach coordinates can reliably be mapped onto individual MRI-based brain anatomy, a Level B localization procedure or not ( Duecker et al., 2014 )? We therefore consider these Levels of evidence a rough classification that may be applied directly, but in equal measure can guide a more informal evaluation of null result meaningfulness.

The Orthogonal Contributions of Interpretability and Surprise

We here discuss more explicitly how this model can guide informal assessment of null results. As mentioned, the gradient of interpretability underlying the Levels of evidence is orthogonal to the gradient of surprise, since it focuses on how informative an individual null finding is in isolation , ignoring our prior expectations or level of surprise by the outcome. But of course, the a priori likelihood of a positive finding or a null result, in other words our level of surprise at the obtained null result, should influence how we interpret and credit such a result in the greater context of previous research and theory.

In fact, this lies at the heart of the Bayesian school of thought: we integrate our prior beliefs with the new data, to come to a new belief. We suggest that evaluating a null result along the two gradients of our model can essentially guide an “informal Bayesian analysis.” In the informal procedure of integrating our prior understanding with the new data, we can for instance allocate less or more weight to the new data (null result) and less or more weight to our prior expectations. The weight allocated to the new data is determined by the position of the null result along the gradient of interpretability (horizontal in Figure ​ Figure1), 1 ), while the weight of the prior expectations directly relates to the level of surprise instigated by the null result (vertical in Figure ​ Figure1 1 ).

For example, if a Level C null result is very surprising, because a dominant theory and several previous studies predicted a positive finding, we should not be so quick to reject the theory, and seriously consider the possibility that we obtained a Type-II error. And, of course, we should design a new experiment that would yield more informative null results. In contrast, a Level A null result could make us rethink the theory, or inspire follow-up experiments to determine what caused the discrepancy. Yet, in this scenario we still do not outright accept even a Level A null result. But had the very same Level A null result been obtained in the absence of prior expectations, because the experiment addressed a fundamentally new research question, then this Level A Exploration null result could guide our beliefs more strongly, even forming the starting point for a new theory.

These arguments are quite abstract, so perhaps it is useful to consider two more concrete examples. Imagine an experiment that fails to replicate a well-established TMS effect, suppression of visual stimuli by a single occipital TMS pulse at 100 ms after stimulus onset ( Amassian et al., 1989 ; de Graaf et al., 2011a , b , c , 2015 ). This Replication null would very much surprise us, given the extensive literature supporting this effect ( de Graaf et al., 2014 ). Independently of this surprise, we would consider the null result less meaningful if the TMS coil was simply positioned just a few centimeters above the inion, versus if the TMS coil elicited phosphenes in the retinotopic stimulus location (neural efficacy check based on perceptual marker), or were neuronavigated to an individual hotspot in V2 corresponding to the retinotopic stimulus location (sophisticated localization procedure). Similarly, if an imaginary fMRI experiment found that appreciation of magic tricks scales with posterior parietal cortical BOLD activation, one might follow it up with an exploratory TMS experiment. Not finding any reduced appreciation for magic tricks after inhibitory TMS, an Exploration null result, might not necessarily surprise us. But, again independently of our lack of surprise, we would find the null result much more informative if we had neuronavigated the TMS coil to individual hotspots as compared to placing the TMS coil over the EEG 10-20 P4 location (sophisticated localization procedure), or if the reduction in magic appreciation was zero as compared to a reduction in magic appreciation that was in the right direction but just failed to reach significance (effect size evaluation).

The Bayesian Inference Approach

In sum, positioning a null result along the gradients of interpretability and surprise can help us in the interpretation of such a null result and the extent to which we should let it change our beliefs. We even called this an “informal Bayesian analysis.” But, as mentioned previously, formally null results cannot be interpreted at all. At least, not in the conventional frequentist statistical framework of classical null hypothesis testing yielding P -values. Some examples of limitations of P -values ( Wagenmakers, 2007 ; Wagenmakers et al., 2018b ):

  • simple 1. we cannot interpret any outcome as supporting the null hypothesis;
  • simple 2. many find it difficult to correctly interpret P -values ( Nickerson, 2000 ; Verdam et al., 2014 );
  • simple 3. hypothesis tests are biased against the null hypothesis, since the P -value ignores both predictions of and a priori likelihood of the alternative hypothesis, only testing predictions of the null hypothesis;
  • simple 4. P -values not only fail to quantify evidence for the null hypothesis, they also fail to quantify evidence for any alternative hypothesis;
  • simple 5. the entire analysis framework relies on imaginary replications that form a sampling distribution, required to obtain a P -value.

Perhaps due to such limitations, P -values are increasingly reported alongside additional statistics, such as effect sizes and confidence intervals. But they remain the mainstay of inference in NIBS research. Classical null hypothesis testing does not consider the prior likelihood of null or alternative hypotheses. It does not account for how likely any outcome is based on previous research, solid theoretical foundations, or just common sense. As a result, the same P -value coming out of the same hypothesis test might convince an experienced researcher in one case, but not at all in another case. This applies to positive findings as well as negative findings. As such, experienced researchers do not merely take P -values at face value, but generally evaluate them in greater context already.

The Bayesian approach is fundamentally different, and its formal implementation as an alternative statistical approach is becoming increasingly widespread. It can incorporate those prior beliefs we mentioned, formalizing in the Bayesian analysis framework some of what experienced researchers informally do when dealing with frequentist analysis outcomes. Model priors reflect the a priori likelihood of null or alternative hypotheses. But also prior expectations about particular parameters, such as an effect size in a population, can be quantified under the null and alternative hypothesis, prior to data collection. The analysis takes those priors, then takes the collected data from an experiment, and integrates these sources of information (updates the prior distribution) to yield a “posterior distribution.” The posterior distribution is our updated belief about the effect under investigation, yields a certain “credible interval” when it comes to an estimated parameter (a 95% credible interval for a parameter means we are 95% sure the parameter lies in that interval), and can directly be evaluated for hypothesis testing.

Bayesian hypothesis testing uses the priors and the available data to weigh the evidence for and against the null and alternative hypotheses. The analysis can yield a single numerical value called the Bayes Factor , which quantifies the relative evidence. It quantifies which hypothesis is more supported by the data and to what extent, or in other words, which of the two hypotheses best predicted the data and by how much. This approach, as any, has its own limitations; the quantification of priors may not always be trivial, for example, and just as the classical P < 0.05 criterion for classification of a significant effect is arbitrary, the Bayes Factor (BF) is descriptively interpreted as providing “anecdotal,” “moderate,” or “strong” evidence for either hypothesis at some arbitrary boundaries. However, the approach does offer benefits and capabilities not afforded by classical statistics. In the context of the current discussion, Bayesian analysis has one particularly important advantage: in Bayesian analysis, one can interpret a null result and formally draw conclusions based on it (see also in this issue: Biel and Friedrich, 2018 ).

This article is not intended to be a (Bayesian) statistics tutorial, but it may be useful to provide an example with just a little bit more concrete information. In Figure ​ Figure2 2 we show generated data from a simple fictional TMS experiment, measuring reaction times in a real TMS condition as compared to a sham TMS condition (2A). The figure shows the outcomes of a standard paired-samples t -test, as well at the outcomes of a Bayesian equivalent paired-samples test (2B). These analyses were all performed with the free and open-source statistical software package JASP ( JASP Team, 2018 ; Wagenmakers et al., 2018a , b ). A relevant outcome of the Bayesian test is the BF. The Bayes Factor BF10 reflects how much more likely the data were to occur under the alternative hypothesis as compared to the null hypothesis (1C). Bayes Factor BF01 is the inverse (1/BF10). If the BF is larger than 3, the evidence is considered “moderate,” if larger than 10, the evidence is considered “strong” ( Wagenmakers et al., 2018a ). If one is worried about the concept of priors, and how much of an influence a chosen prior has on the outcome of the analysis, this can simply be evaluated quantitively (2D). See the figure caption for further details.

An external file that holds a picture, illustration, etc.
Object name is fnins-12-00915-g002.jpg

Bayesian analysis to assess support for a null hypothesis. (A) Results of two fictional within-subject conditions in a (random-walk) generated dataset with 40 virtual observers. There does not seem to be a meaningful difference in RT between both conditions. (B) A traditional paired-samples t -test provides no reason to reject the null hypothesis ( P > 0.05). But it can provide no evidence to accept the null hypothesis. The Bayesian paired-samples t -test equivalent yields a BF01 = 5.858. This means that one is 5.858 times more likely to obtain the current data if the null hypothesis is true, than if the alternative hypothesis is true. In the recommended interpretational framework, this constitutes “moderate” evidence for the null hypothesis (BF > 3, but < 10 which would constitute “strong” evidence). (C) A plot of the prior distribution (dashed) and posterior distribution (solid), reflecting probability density (vertical axis) of effect sizes in the population (Cohen’s d, horizontal axis). In these tests, the prior distribution is conventionally centered around 0, but its width can be set by the user to reflect strength of prior expectations. The width of the posterior distribution reflects confidence about effect size based on the prior and the data: the horizontal bar outlines the “credible interval” which contains 95% of the posterior distribution density. This width/interval will be smaller with increasing sample size. The median of the posterior distribution (0.002) is a point estimate of the real effect size. BF10 here is simply the inverse of BF01, and the “wedge-chart” visualizes how much more likely one is to obtain the current data given H0 versus H1. (D) Since setting a prior (width) is not always straightforward, one can plot how much the outcome of the analysis (BF01, vertical axis) depends on the selected prior (horizontal axis). At the top are the BF values for a few points on the plot (user-selected prior, in this case the JASP-provided default of Cauchy prior width = 0.707, wide, and ultrawide prior). Clearly the evidence for the null hypothesis (likelihood of obtaining current data if null hypothesis is true) ranges for most reasonable priors from moderate to strong.

Bayesian Statistics and Levels of Evidence

Bayesian statistics can formally and quantitatively evaluate the strength of evidence of a null result. Since this is what we informally intended to achieve with a classification of Levels of null evidence, should we not simply demand Bayesian statistics instead? Or require certain ranges of BFs before we accept null results as Level B or A evidence? This is an interesting question, but currently we think not. Firstly, because Bayesian analysis is not yet commonplace, although it is increasingly implemented. This does not mean we doubt the approach, but we would like our model to be helpful also to researchers not ready to commit to Bayesian analysis. Secondly, the BF quantifies the support for the null (and alternative) hypothesis in the data. The gradient of interpretability, however, (also) ranks how much faith we should put in those data in the first place, based on certain experimental design decisions. As such, the results of Bayesian analysis (or classical inference) only become meaningful once we reach a certain level of confidence in the experimental design and its power to yield meaningful data.

Having said that, we do strongly advocate the application of Bayesian analysis, at least to complement classical hypothesis tests when presenting null results. We are hesitant to propose, for example, that a Level A classification can only be extended to a null result backed up by Bayesian statistics. We would, however, suggest that, to build a case against a null hypothesis based on a null result, adding Bayesian analysis is preferred and would usually make the case stronger. A Level A null result comes from an experiment with such a sophisticated design that its data should be considered meaningful. Therefore, formal Bayesian quantification of support for the null hypothesis would be meaningful also.

This is an opinion paper, meant to provoke thought, insight, and discussion. Some of the ideas, and especially criteria for classification of null result interpretability, are somewhat arbitrary and may change with new insights, developments, or time. But the fundamental idea underlying it, is that not all null results are created equal. And it may be useful to reflect on what it is about null results, and especially the design of NIBS experiments underlying them, that allows us to consider them less or more meaningful. Our proposals here provide concrete heuristics, but are open to amendment, correction, or expansion.

Author Contributions

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

We would like to thank Eric-Jan Wagenmakers, Sanne ten Oever, and two reviewers for helpful comments on earlier drafts of the manuscript. The views expressed in this article and any remaining errors are our own.

Funding. This work was supported by the Netherlands Organization for Scientific Research (NWO, Veni to TDG 451-13-024; Vici to AS 453-15-008).

  • Amassian V. E., Cracco R. Q., Maccabee P. J., Cracco J. B., Rudell A., Eberle L. (1989). Suppression of visual perception by magnetic coil stimulation of human occipital cortex. Electroencephalogr. Clin. Neurophysiol. 74 458–462. [ PubMed ] [ Google Scholar ]
  • Biel A. L., Friedrich E. V. C. (2018). Why you should report bayes factors in your transcranial brain stimulation studies. Front. Psychol. 9 : 274 . 10.3389/fpsyg.2018.01125 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bikson M., Datta A., Rahman A., Scaturro J. (2010). Electrode montages for tDCS and weak transcranial electrical stimulation: role of “return” electrode’s position and size. Clin. Neurophysiol. 121 1976–1978. 10.1016/j.clinph.2010.05.020 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Chechlacz M., Humphreys G. W., Sotiropoulos S. N., Kennard C., Cazzoli D. (2015). Structural organization of the corpus callosum predicts attentional shifts after continuous theta burst stimulation. J. Neurosci. 35 15353–15368. 10.1523/JNEUROSCI.2610-15.2015 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Datta A., Bansal V., Diaz J., Patel J., Reato D., Bikson M. (2009). Gyri-precise head model of transcranial direct current stimulation: improved spatial focality using a ring electrode versus conventional rectangular pad. Brain Stimul. 2 201–207. 10.1016/j.brs.2009.03.005 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • de Graaf T. A., Cornelsen S., Jacobs C., Sack A. T. (2011a). TMS effects on subjective and objective measures of vision: stimulation intensity and pre-versus post-stimulus masking. Conscious. Cogn. 20 1244–1255. 10.1016/j.concog.2011.04.012 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • de Graaf T. A., de Jong M. C., Goebel R., van Ee R., Sack A. T. (2011b). On the functional relevance of frontal cortex for passive and voluntarily controlled bistable vision. Cereb. Cortex 21 2322–2331. 10.1093/cercor/bhr015 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • de Graaf T. A., Herring J., Sack A. T. (2011c). A chronometric exploration of high-resolution “sensitive TMS masking”effects on subjective and objective measures of vision. Exp. Brain Res. 209 19–27. 10.1007/s00221-010-2512-z [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • de Graaf T. A., Duecker F., Fernholz M. H. P., Sack A. T. (2015). Spatially specific vs. unspecific disruption of visual orientation perception using chronometric pre-stimulus TMS. Front. Behav. Neurosci. 9 : 5 . 10.3389/fnbeh.2015.00005 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • de Graaf T. A., Hsieh P.-J., Sack A. T. (2012). The “correlates” in neural correlates of consciousness. Neurosci. Biobehav. Rev. 36 191–197. 10.1016/j.neubiorev.2011.05.012 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • de Graaf T. A., Koivisto M., Jacobs C., Sack A. T. (2014). The chronometry of visual perception: review of occipital TMS masking studies. Neurosci. Biobehav. Rev. 45 295–304. 10.1016/j.neubiorev.2014.06.017 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • de Graaf T. A., Sack A. T. (2011). Null results in TMS: from absence of evidence to evidence of absence. Neurosci. Biobehav. Rev. 35 871–877. 10.1016/j.neubiorev.2010.10.006 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • de Graaf T. A., Sack A. T. (2014). Using brain stimulation to disentangle neural correlates of conscious vision. Front. Psychol. 5 : 1019 . 10.3389/fpsyg.2014.01019 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • de Graaf T. A., van den Hurk J., Duecker F., Sack A. T. (2018). Where are the fMRI correlates of phosphene perception? Front. Neurosci. 12 : 883 10.3389/fnins.2018.00883 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Drysdale A. T., Grosenick L., Downar J., Dunlop K., Mansouri F., Meng Y., et al. (2017). Resting-state connectivity biomarkers define neurophysiological subtypes of depression. Nat. Med. 23 28–38. 10.1038/nm.4246 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Duecker F., Frost M. A., Graaf T. A., Graewe B., Jacobs C., Goebel R., et al. (2014). The cortex-based alignment approach to TMS coil positioning. J. Cogn. Neurosci. 26 2321–2329. 10.1162/jocn_a_00635 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Edwards D., Cortes M., Datta A., Minhas P., Wassermann E. M., Bikson M. (2013). Physiological and modeling evidence for focal transcranial electrical brain stimulation in humans: a basis for high-definition tDCS. NeuroImage 74 266–275. 10.1016/j.neuroimage.2013.01.042 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Goldsworthy M. R., Pitcher J. B., Ridding M. C. (2012). A comparison of two different continuous theta burst stimulation paradigms applied to the human primary motor cortex. Clin. Neurophysiol. 123 2256–2263. 10.1016/j.clinph.2012.05.001 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hallett M. (2007). Transcranial magnetic stimulation: a primer. Neuron 55 187–199. 10.1016/j.neuron.2007.06.026 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Herrmann C. S., Strüber D., Helfrich R. F., Engel A. K. (2015). EEG oscillations: from correlation to causality. Int. J. Psychophysiol. 103 12–21. 10.1016/j.ijpsycho.2015.02.003 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • JASP Team (2018). JASP Version 0.9 Computer software. [ Google Scholar ]
  • Maeda F., Keenan J. P., Tormos J. M., Topka H., Pascual-Leone A. (2000). Interindividual variability of the modulatory effects of repetitive transcranial magnetic stimulation on cortical excitability. Exp. Brain Res. 133 425–430. 10.1007/s002210000432 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Miranda P. C., Lomarev M., Hallett M. (2006). Modeling the current distribution during transcranial direct current stimulation. Clin. Neurophysiol. 117 1623–1629. 10.1016/j.clinph.2006.04.009 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Moliadze V., Antal A., Paulus W. (2010). Electrode-distance dependent after-effects of transcranial direct and random noise stimulation with extracephalic reference electrodes. Clin. Neurophysiol. 121 2165–2171. 10.1016/j.clinph.2010.04.033 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nickerson R. S. (2000). Null hypothesis significance testing: a review of an old and continuing controversy. Psychol. Methods 5 241–301. 10.1037/1082-989X.5.2.241 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Oliver R., Bjoertomt O., Driver J., Greenwood R., Rothwell J. (2009). Novel “hunting” method using transcranial magnetic stimulation over parietal cortex disrupts visuospatial sensitivity in relation to motor thresholds. Neuropsychologia 47 3152–3161. 10.1016/j.neuropsychologia.2009.07.017 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Peters J. C., Reithler J., Schuhmann T., de Graaf T. A., Uludag K., Goebel R., et al. (2013). On the feasibility of concurrent human TMS-EEG-fMRI measurements. J. Neurophysiol. 109 1214–1227. 10.1152/jn.00071.2012 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Reithler J., Peters J. C., Sack A. T. (2011). Multimodal transcranial magnetic stimulation: using concurrent neuroimaging to reveal the neural network dynamics of noninvasive brain stimulation. Progr. Neurobiol. 94 149–165. 10.1016/j.pneurobio.2011.04.004 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Rizk S., Ptak R., Nyffeler T., Schnider A., Guggisberg A. G. (2013). Network mechanisms of responsiveness to continuous theta-burst stimulation. Eur. J. Neurosci. 38 3230–3238. 10.1111/ejn.12334 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Robertson E. R., Théoret H., Pascual-Leone A. (2003). Studies in cognition: the problems solved and created by transcranial magnetic stimulation. J. Cogn. Neurosci. 15 948–960. 10.1162/089892903770007344 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Romei V., Thut G., Silvanto J. (2016). Information-based approaches of noninvasive transcranial brain stimulation. Trends Neurosci. 39 782–795. 10.1016/j.tins.2016.09.001 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sack A. T. (2006). Transcranial magnetic stimulation, causal structure–function mapping and networks of functional relevance. Curr. Opin. Neurobiol. 16 593–599. 10.1016/j.conb.2006.06.016 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sack A. T., Camprodon J. A., Pascual-Leone A., Goebel R. (2005). The dynamics of interhemispheric compensatory processes in mental imagery. Science 308 702–704. 10.1126/science.1107784 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sack A. T., Cohen Kadosh R., Schuhmann T., Moerel M., Walsh V., Goebel R. (2009). Optimizing functional accuracy of TMS in cognitive studies: a comparison of methods. J. Cogn. Neurosci. 21 207–221. 10.1162/jocn.2009.21126 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sandrini M., Umilta C., Rusconi E. (2010). The use of transcranial magnetic stimulation in cognitive neuroscience: a new synthesis of methodological issues. Neurosci. Biobehav. Rev. 35 516–536. 10.1016/j.neubiorev.2010.06.005 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Schilberg L., Engelen T., ten Oever S., Schuhmann T., de Gelder B., de Graaf T. A., et al. (2018). Phase of beta-frequency tACS over primary motor cortex modulates corticospinal excitability. Cortex 103 142–152. 10.1016/j.cortex.2018.03.001 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Stokes M. G., Chambers C. D., Gould I. C., Henderson T. R., Janko N. E., Allen N. B., et al. (2005). Simple metric for scaling motor threshold based on scalp-cortex distance: application to studies using transcranial magnetic stimulation. J. Neurophysiol. 94 4520–4527. 10.1152/jn.00067.2005 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ten Oever S., de Graaf T. A., Bonnemayer C., Ronner J., Sack A. T., Riecke L. (2016). Stimulus presentation at specific neuronal oscillatory phases experimentally controlled with tACS: implementation and applications. Front. Cell. Neurosci. 10 : 240 . 10.3389/fncel.2016.00240 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Verdam M. G. E., Oort F. J., Sprangers M. A. G. (2014). Significance, truth and proof of p values: reminders about common misconceptions regarding null hypothesis significance testing. Q. Life Res. 23 5–7. 10.1007/s11136-013-0437-2 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Vöröslakos M., Takeuchi Y., Brinyiczki K., Zombori T., Oliva A., Fernández-Ruiz A., et al. (2018). Direct effects of transcranial electric stimulation on brain circuits in rats and humans. Nat. Commun. 9 : 483 . 10.1038/s41467-018-02928-3 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wagenmakers E.-J. (2007). A practical solution to the pervasive problems of p values. Psychon. Bull. Rev. 14 779–804. 10.3758/BF03194105 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wagenmakers E.-J., Love J., Marsman M., Jamil T., Ly A., Verhagen J., et al. (2018a). Bayesian inference for psychology. Part II: example applications with JASP. Psychon. Bull. Rev. 25 58–76. 10.3758/s13423-017-1323-7 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wagenmakers E.-J., Marsman M., Jamil T., Ly A., Verhagen J., Love J., et al. (2018b). Bayesian inference for psychology. Part I: theoretical advantages and practical ramifications. Psychon. Bull. Rev. 25 35–57. 10.3758/s13423-017-1343-3 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wagenmakers E.-J., Verhagen J., Ly A., Bakker M., Lee M. D., Matzke D., et al. (2015). A power fallacy. Behav. Res. Methods 47 913–917. 10.3758/s13428-014-0517-4 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wagner T., Fregni F., Fecteau S., Grodzinsky A., Zahn M., Pascual-Leone A. (2007). Transcranial direct current stimulation: a computer-based human model study. NeuroImage 35 1113–1124. 10.1016/j.neuroimage.2007.01.027 [ PubMed ] [ CrossRef ] [ Google Scholar ]

experiment null results

Staff & Students

  • Staff Email
  • Semester Dates
  • Parking (SOUPS)
  • For Students
  • Student Email
  • SurreyLearn

Members of the CoGDeV lab share their research findings, research experiences, news and events summaries.

The value of null results.

Until very recently I had never heard of Open Research and what it meant.  I came across an article in The Conversation that had a headline that caught my attention ‘Medical research is broken: here’s how to fix it’. On reading the article I learned about the reproducibility crisis. I was astounded to read that as much as 85% of medical research is ‘wasted’ and that about 50% of medical research is not published (Lloyd & Bradley, 2020). No domains are immune to the reproducibility crisis. For example, in psychology, in an attempt to reproduce findings of 100 studies The Open Science Collaboration found that only 36% of these studies could be reproduced (Gilbert & Strohminger, 2015). When another researcher cannot replicate an experiment we naturally question the validity of the results of the original experiment. This has serious implications for research and science as replicating studies is crucial to verify findings, to advance knowledge and to build a robust evidence-based literature. So how could that be? The relatively high prevalence of questionable research practices (QRPs) such as selectively reporting hypotheses and excluding data post-hoc (John et al., 2012) as well as the non-publication of failed studies, ‘null results’, all seem to be playing a role in maintaining the reproducibility crisis. QRPs such as excluding data after looking at the impact of doing so on the results can significantly improve the probability of finding evidence in support of a hypothesis, making it harder to subsequently replicate the study (John et al, 2012). QRPs although problematic and widespread are not unequivocally unacceptable but call for greater transparency in the research process. The factor I am most intrigued about is the non-publication of null results. A null result is a result that does not support the experimental hypothesis and is therefore difficult to publish because many perceive these results as less interesting.

Journal editors tend to publish studies that are statistically significant, which are more likely to be cited and therefore raise the reach and impact of their journals (Fanelli, 2010). Journal editors unwittingly contribute to a vicious cycle: researchers are more likely to engage in QPRs to get a positive result so their research has a higher chance to get cited and be accepted for publication by editors (Wenzel, 2016). Unfortunately, this unhelpful pattern works against the publication of many rigorously well-conducted studies that yield a null result.

Publishing only successful studies falsely gives the impression that experiments almost always work and that research gets it right every time. Research is usually, by nature, exploratory; some studies will yield positive results, but null results form an integral part of the process of exploration. Many researchers would be willing to spend time and effort to publish a result that does not support their hypothesis, however they face a barrier of finding a journal that will be willing to publish it. Many journals and editors steer away from null findings because null findings are cited less frequently (Mlinarić, Horvat, & Šupak Smolčić, 2017). This means that there is a publication bias against the null result whereby successful studies are more likely to be published. Since null results do not always support what we expected to find it can make them difficult to interpret however they are an integral part of scientific research. The endeavour of research involves keeping track of our unsuccessful attempts and learning from them. As researchers, we reflect and review what worked and what did not work, make the necessary amendments and try again. It takes us many trials to refine and master a new skill. The same is true in science; many trials are often necessary to understand our findings.

Unfortunately, the non-publication of null results means that useful data is kept away from collective knowledge. Publishing null results helps the scientific community by allowing researchers to build on previous studies.  The publication of unsuccessful attempts has the potential to save other researchers’ time. Indeed, a new researcher may use a similar study protocol than a previously unpublished null study and risk having their findings go against their predictions leading to a null result.

I also wonder about the influence of the researcher’s experience and belief system on their willingness to publish their null results. It is conceivable that some researchers may fear exposing that their theory had to change in light of the null findings. This could undermine their previous papers. This may be particularly true at the start of a career when the desire to be published in a renowned journal is highly enticing, which may encourage researchers to hold back from sharing novel hypothesis or engage in QRPs. Moreover, those researchers who publish a null result that challenges the current theoretical understanding of a particular issue may fear the consequences of questioning the status quo. It takes time and resources to publish results that do not support a carefully thought out hypothesis. It is understandable that researchers are reluctant to publish their null results, as there is little incentive to do so. Furthermore, established and new researchers alike may wish to preserve their status and not publish a null result that would contradict previous findings and may attract ‘the wrath’ of the original researcher (Nature editorial, 2017).

In the field of psychology, progress would not have happened if researchers had not carefully considered why their result went against their prediction. That is, null findings have the potential to move theory forward and challenge current theoretical understanding. For instance, a comprehensive study on candidate genes for depression challenged previous studies that had seemingly demonstrated that there was an association between genes and major depression. This comprehensive study yielded a null result. Because the study had been rigorously conducted it suggested that there was no significant association between genes and major depression (Border et al., 2019). More recently, the null findings of a study on approximate arithmetic training challenged the claim that training improves arithmetic fluency in adults; a theory that had been supported by several training studies with a positive result (Szkudlarek, Park & Brannon, 2020).

Commitment to open research practices by universities, funders and publishers is starting to have an impact on this issue by raising the visibility of null findings among students and researchers. The Registered Report which is a publishing model that works on the basis of ‘in-principle acceptance’- a study is accepted for publication before knowing its outcome – can help alleviate the publication bias. The in principle acceptance is likely to increase the publication of null results as the criteria for publication is not the result itself but the quality of research (Soderberg et al., 2020).

In order to challenge the publication bias a group of editors from prestigious journals (European Journal of Work and Organizational Psychology, Journal of Business & Psychology, etc.) have committed to publish null results resulting from rigorously conducted research. This is part of a new initiative to enhance the integrity and quality of research (Wenzel, 2016). Perhaps another solution to encourage the scientific community to share their null results would be to publish a summary of all studies in an online ‘null results’ journal, which would enable other researchers to access them. An additional solution could be to add a mandatory section in every paper summarising the previous known null results, what was learned from them, and how they led to the publication of the conclusive study. Pre-registration – a time-stamped record of a study design & analysis plan created prior to the data collection ( https://www.ukrn.org/primers/ ; Farran, 2020)- could also help that process by encouraging researchers to be transparent about their protocol, data analysis and outcome expectations from the outset. It takes a level of confidence to publicly admit that after rigorously conducting a study our results do not support our experimental hypothesis. I wonder if that ‘admission’ could be eased by re-affirming that research is often exploratory: it would be unrealistic to always expect a positive result in research. Changing the narrative of null results into a story of learning and progress may be supportive of researchers sharing their null results with the wider scientific community. By taking the focus of research off the results and moving it onto the quality of the study protocol and how rigorously a study has been conducted, I believe we can reduce the prevalence of QRPs and give null findings their rightful place in the world of scientific research.

I believe that increasing co-operation could also be helpful. The publication of null results carries the risk that other researchers using a similar protocol and have a positive outcome potentially taking credit away from the original researcher. A change in culture where collaboration is valued above competition may help that movement. We also need a common framework for replication of studies with standardised protocols including the publication and/or consideration of null results.

Null findings tend not to be published, which I feel is a waste of resources. The publications of these findings would benefit the wider scientific community. Research at its best is a collective effort with collaboration between organisations, increased transparency, and a smaller emphasis on competition and protection of resources.

Border, R., Johnson, E. C., Evans, L. M., Smolen, A., Berley, N., Sullivan, P. F., & Keller, M. C. (2019). No Support for Historical Candidate Gene or Candidate Gene-by-Interaction Hypotheses for Major Depression Across Multiple Large Samples. The American journal of psychiatry , 176(5), 376–387. https://doi.org/10.1176/appi.ajp.2018.18070881

Fanelli, D. (2010). “Positive” results increase down the Hierarchy of the Sciences. PloS one , 5(4), e10068. https://doi.org/10.1371/journal.pone.0010068

Farran E. K., (2020). Is pre-registration for you? Retrieved from https://blogs.surrey.ac.uk/doctoralcollege/2020/01/06/guest-blog-prof-emily-farran-is-pre-registration-for-you/

Gilbert, E., Strohminger, N. (2015) We found only one-third of published psychology research is reliable – now what? The conversation . https://theconversation.com/we-found-only-one-third-of-published-psychology-research-is-reliable-now-what-46596

 John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling. Psychological Science , 23(5), 524–532. https://doi.org/10.1177/0956797611430953

Lloyd, K. E., Bradley, S. (2020). Medical research is broken here’s how we can fix it reference. The conversation. https://theconversation.com/medical-research-is-broken-heres-how-we-can-fix-it-145281

Mlinarić, A., Horvat, M., & Šupak Smolčić, V. (2017). Dealing with the positive publication bias: Why you should really publish your negative results. Biochemia medica , 27(3), 030201. https://doi.org/10.11613/BM.2017.030201

Rewarding negative results keeps science on track. Nature 551, 414 (2017). https://doi.org/10.1038/d41586-017-07325-2

Soderberg, C. K., Errington, T. M., Schiavone, S. R., Bottesini, J. G., Singleton Thorn, F., Vazire, S., … Nosek, B. A. (2020, November 16). Initial Evidence of Research Quality of Registered Reports Compared to the Traditional Publishing Model. https://doi.org/10.31222/osf.io/7x9vy

Szkudlarek, E., Park, J., & Brannon , E. (2020). Failure to replicate the benefit of approximate arithmetic training for symbolic arithmetic fluency in adults. Cognition . 207. 104521. 10.1016/j.cognition.2020.104521.

Wenzel, R. (2016). Business journals to tackle publication bias, will publish ‘null’ results. The conversation. https://theconversation.com/business-journals-to-tackle-publication-bias-will-publish-null-results-52818

Written by Badri Bechlem

Undergraduate’s perspective on Open Research

Centre for educational neuroscience online seminar: “why do children with dcd trip and fall more than others looking around for answers”.

experiment null results

Accessibility | Contact the University | Privacy | Cookies | Disclaimer | Freedom of Information

© University of Surrey, Guildford, Surrey, GU2 7XH, United Kingdom. +44 (0)1483 300800

2 minute read

Michelson-Morley Experiment

The null result.

The Michelson-Morley experiment is a perfect example of a null experiment, one in which something that was expected to happen is not observed. The consequences of their observations for the development of physics were profound. Having proven that there could be no stationary ether, physicists tried to advance new theories that would save the ether concept. Michelson himself suggested that the ether might move, at least near the Earth. Others studied the possibility that rigid objects might actually contract as they traveled. But it was Einstein's theory of special relativity that finally explained their results.

The significance of the Michelson-Morley experiment was not assimilated by the scientific community until after Einstein presented his theory. In fact, when Michelson was awarded the Nobel prize in physics in 1907, the first American to receive that honor, it was for his measurements of the standard meter using his interferometer. The ether wind experiment was not mentioned.

There has also been some controversy as to how the experiment affected the development of special relativity. Einstein commented that the experiment had only a negligible effect on the formulation of his theory. Clearly it was not a starting point for him. Yet the experiment has been repeated by others over many years, upholding the original results in every case. Even if special relativity did not spring directly from its results, the Michelson-Morley experiment has convinced many scientists of the accuracy of Einstein's theory and has remained one of the foundations upon which relativity stands.

See also Relativity, special .

French, A.P. Special Relativity. New York: W.W. Norton, 1968.

Jenkins, F.A., and H.E. White. Fundamentals of Optics. New York: McGraw-Hill, 1976.

Periodicals

"Special Issue: Michelson-Morley Centennial." Physics Today (May 1987).

American Institute of Physics. "Michelson-Morley Experiment" [cited April 2003]. <http://www.aip.org/history/einstein/emc1.htm> .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

—The concept that space exists independently of the objects that occupy it.

—The deviation from a straight path that occurs when a wave passes through an aperture.

—The change in intensity caused by mixing two or more beams of light.

—Alternating dark and bright lines produced by the mixing of two beams of light in an interferometer.

—A hypothetical medium proposed to explain the propagation of light. The Michelson-Morley experiment made it necessary to abandon this hypothesis.

—An instrument designed to divide a beam of visible light into two beams which travel along different paths until they recombine for observation of the interference fringes that are produced. Interferometers are used to make precision measurements of distances.

—The part of Einstein's theory of relativity that deals only with nonaccelerating (inertial) reference frames.

Additional topics

  • Michelson-Morley Experiment - The Michelson Interferometer
  • Other Free Encyclopedias

Science Encyclopedia Science & Philosophy: Methane to Molecular clock Michelson-Morley Experiment - The Luminiferous Ether, The Michelson Interferometer, The Null Result

  • Health Tech
  • Health Insurance
  • Medical Devices
  • Gene Therapy
  • Neuroscience
  • H5N1 Bird Flu
  • Health Disparities
  • Infectious Disease
  • Mental Health
  • Cardiovascular Disease
  • Chronic Disease
  • Alzheimer's
  • Coercive Care
  • The Obesity Revolution
  • The War on Recovery
  • Adam Feuerstein
  • Matthew Herper
  • Jennifer Adaeze Okwerekwu
  • Ed Silverman
  • CRISPR Tracker
  • Breakthrough Device Tracker
  • Generative AI Tracker
  • Obesity Drug Tracker
  • 2024 STAT Summit
  • All Summits
  • STATUS List
  • STAT Madness
  • STAT Brand Studio

Don't miss out

Subscribe to STAT+ today, for the best life sciences journalism in the industry

‘Null’ research findings aren’t empty of meaning. Let’s publish them

By Anupam B. Jena Nov. 10, 2017

null findings

E very medical researcher dreams of doing studies or conducting clinical trials that generate results so compelling they change how diseases are treated or health policy is written. In reality, we are lucky if the results are even a little bit positive, and often end up with “null” results, meaning that the effect of a policy, drug, or clinical intervention that we tested is no different than that of some alternative.

“Null” comes from the null hypothesis, the bedrock of the scientific method. Say I want to test whether the switch to daylight saving time affects the outcomes of surgery because surgeons may be slightly more fatigued in the days following the transition due to lost sleep. I set up a null hypothesis — surgery-related deaths are no different in the days immediately before the switch to daylight saving time compared to the days immediately after it — and then try to nullify, or disprove, it to show that there was indeed a difference. (Read on to see the answer, though you can probably guess from the headline what it is.) Disproving the null hypothesis is standard operating procedure in science.

advertisement

Null results are exceedingly common. Yet they aren’t nearly as likely to get published as “positive” results, even though they should be. In an analysis of nearly 30,000 presentations made at scientific conferences, fewer than half were ultimately published in peer-reviewed journals, and negative or null results were far less likely to be published than positive results. Clinical trials with positive findings are published more often and sooner than negative or null trials.

That’s a shame, because publishing null results is an important endeavor. Some null results represent potentially important discoveries, such as finding that paying hospitals for performance based on the quality of their outcomes has no effect on actually improving quality. The majority of research questions, though, don’t fall into this category. Leaving null results unpublished can also result in other researchers conducting the same study, wasting time and resources.

Related: Keep negativity out of politics. We need more of it in journals

Some unpublished null findings are on important topics, like whether public reporting of physician’s outcomes leads physicians to “game the system” and alter the care that they provide patients. Others come from explorations of quirkier topics.

Here are a few of each from my own unpublished research.

Daughters and life expectancy. Daughters are more likely than sons to provide care to their ailing parents. Does that mean being blessed with a daughter translates into greater life expectancy? Using data from the U.S. Health and Retirement Study , I compared mortality rates among adults with one daughter versus those with one son. There was no difference. Ditto for families with two daughters versus two sons.

Daylight saving time and surgical mortality. The switch to daylight saving time in the spring has been linked to increased driving accidents immediately after the transition, attributed to fatigue from the hour of lost sleep. I investigated whether this time switch affects the care provided by surgeons by studying operative mortality in the days after the transition. U.S. health insurance claims data from 2002 to 2012 showed no increase in operation-related deaths in the days after the transition to daylight saving time compared to the days just before it.

Tubal ligations and son preference. A preference for sons has been documented in developing countries such as China and India as well as in the United States . When I was a medical student rotating in obstetrics, I heard a patient ask her obstetrician, “Please tie my tubes,” because she had finally had a son. Years later, I investigated whether that observation could be systematically true using health insurance claims data from the U.S. Among women who had recently given birth, there was no difference in later tubal ligation rates between those giving birth to sons versus daughters.

Gaming the reporting of heart surgery deaths. One strategy for improving the delivery of health care is public reporting of doctors’ outcomes. Some evidence suggests that doctors may game the system by choosing healthier patients who are less likely to experience poor outcomes. One important metric is 30-day mortality after coronary artery bypass graft surgery or placement of an artery-opening stent. I wanted to know if heart surgeons were trying to avoid bad scores on 30-day mortality by ordering intensive interventions to keep patients who had experienced one or more complications from the procedure alive beyond the 30-day mark to avoid being dinged in the publicly reported statistics. I hypothesized that in states with public reporting, such as New York, deaths would be higher on post-procedure days 31 to 35 than on days 25 to 29 if doctors chose to keep patients alive by extreme measures. The data didn’t back that up — there was no evidence that cardiac surgeons or cardiologists attempt to game public reporting in this way.

Sign up for the First Opinion newsletter

A weekly digest of our opinion column, with insight from industry experts.

Halloween and hospitalization for high blood sugar. Children consume massive amounts of candy on and immediately after Halloween. Does this onslaught of candy consumption increase the number of episodes of seriously high blood sugar among children with type 1 or type 2 diabetes? I looked at emergency department use and hospitalization for hyperglycemia (high blood sugar) among children between the ages of 5 and 18 years in the eight weeks before Halloween versus the eight weeks after, using as a control group adults aged 35 and older to account for any seasonal trends in hospitalizations. There was no increase in emergency visits for hyperglycemia or hospitalizations for it among either adults or children in the days following Halloween.

The 2008 stock market crash and surgeons’ quality of care. During a three-week period in 2008, the Dow Jones Industrial Average fell 3,000 points, or nearly 25 percent of the Dow’s value. The sharp, massive decline in wealth for many Americans, particularly those with enough money to be heavily invested in stocks, had the potential to create immediate and significant stress. Was this acute, financial stress large enough to throw surgeons off their game? Using U.S. health insurance claims data for 2007 and 2008 that included patient deaths, I analyzed whether weekly 30-day postoperative mortality rates rose in the month following the crash, using 2007 as a control for seasonal trends. There were nearly identical 30-day mortality rates by week in both 2007 and 2008, suggesting that the stock market crash, while stressful, did not distract surgeons from their work.

The bottom line

Not reporting null research findings likely reflects competing priorities of scientific journals and researchers. With limited resources and space, journals prefer to publish positive findings and select only the most important null findings. Many researchers aren’t keen to publish null findings because the effort required to do so may not ultimately be rewarded by acceptance of the research into a scientific journal.

There are a few opportunities for researchers to publish null findings. For example, the Journal of Articles in Support of the Null Hypothesis has been publishing twice a year since 2002, and the Public Library of Science occasionally publishes negative and null results in its Missing Pieces collection. Perhaps a newly announced prize for publishing negative scientific results will spur researchers to pay more attention to this kind of work. The 10,000 Euro prize , initially aimed at neuroscience, is being sponsored by the European College of Neuropsychopharmacology’s Preclinical Data Forum .

For many researchers, though, the effort required to publish articles in these forums may not be worth the lift, particularly since the amount of effort required to write up a positive study is the same as for a null study.

The scientific community could benefit from more reporting of null findings, even if the reports were briefer and had less detail than would be needed for peer review. I’m not sure how we could accomplish that, but would welcome any ideas.

Reporting null findings

Anupam B. Jena, MD, is an economist, physician, and associate professor of health care policy and medicine at Harvard Medical School. He has received consulting fees from Pfizer, Hill Rom Services, Bristol Myers Squibb, Novartis Pharmaceuticals, Vertex Pharmaceuticals, and Precision Health Economics, a company providing consulting services to the life sciences industry.

About the Author Reprints

Anupam b. jena.

To submit a correction request, please visit our Contact Us page .

experiment null results

Recommended

experiment null results

Recommended Stories

experiment null results

How the next president should reform Medicare

experiment null results

Veterinary medicine is key to overcoming antimicrobial resistance

experiment null results

STAT Plus: Q&A: How scientists aim to kill cancers by starving them

experiment null results

STAT Plus: Electronic health records giant Epic Systems sued over alleged monopolistic practices

experiment null results

STAT Plus: Genentech, a biotech with a storied past, confronts new turbulence in the present

experiment null results

  • The Magazine
  • Stay Curious
  • The Sciences
  • Environment
  • Planet Earth

Negative Results, Null Results, or No Results?

Neuroskeptic icon

What happens when a study produces evidence that doesn't support a scientific hypothesis?

Scientists have a few different ways of describing this event. Sometimes, the results of such a study are called ' null results'. They may also be called ' negative results' . In my opinion, both of these terms are useful, although I slightly prefer ' null ' on the grounds that the term 'negative' tends to draw an unfavorable contrast with 'positive' results. Whereas, my impression is that 'null' makes it clear that these are results in their own right, as they are evidence consistent with the null hypothesis . Yet there's another way of talking about evidence inconsistent with a hypothesis - such results are sometimes treated as not being results at all. In this way of speaking, to "get a result" in a certain study means to find a positive result. To "get no results" or "find nothing" means to find only null results - which, on this view, have no value of their own, serving only to mark the absence of some (positive) findings. This 'non-result'  idiom is common usage in science - at least in my experience - but in my view, it's misleading and harmful. A null result is still a result, and it contributes just as much to our knowledge of the world as a positive result does. We may be disappointed by null results, and we may feel that they are not as exciting as the results we hoped to find, but these are really nothing more than our own subjective responses. The view that null results aren't really results is at the root of much of publication bias and motivates p-hacking . The only true "non"-results are results that are so low quality that they are uninformative, either due to poor experimental design or errors in data collection. These failed results may be, on the face of it, either positive or negative.

Already a subscriber?

Register or Log In

Discover Magazine Logo

Keep reading for as low as $1.99!

Sign up for our weekly science updates.

Save up to 40% off the cover price when you subscribe to Discover magazine.

Facebook

  • New Evidence

Null results should produce answers, not excuses

experiment null results

Annette N. Brown

I recently served on a Center for Global Development (CGD) panel to discuss a new study of the effects of community-based education on learning outcomes in Afghanistan. ( Burde, Middleton, and Samii 2016 ) This exemplary randomized evaluation finds some important positive results. But the authors do one thing in the study that almost all impact evaluation researchers do – where they have null results, they make, what I call for the sake of argument, excuses.

With the investment that goes into individual impact evaluations, especially large field trials, null results should produce answers, not excuses. We are quick to conclude that something works when we get positive results, but loathe to come right out and say that something doesn’t work when we get null results. Instead we say things like: perhaps the implementation lacked fidelity; or maybe the design of the intervention didn’t fully match the theory of change; or we just didn’t have enough observations. The U.S. Department of Education commissioned this useful brief to help people sort through the possible explanations for null results.

My argument here applies mostly to evaluations of experimental interventions (which may or may not be experimental, or randomized, evaluations). That is, I am talking about evaluations of new interventions that we want to test to see if they work. In the case of the community-based education evaluation, the study included experiments on two within-treatment variations: one to look at the effects of adding a community participation component to the education intervention, and another to look at the effects of requiring minimum qualifications in teacher selection when that requirement leads to teachers being hired from outside the community.

The authors motivate the first variation, community participation, by explaining that it is a concept commonly incorporated into NGO-administered programs, although the evidence of its effectiveness is mixed. In Afghanistan, they implemented community participation using community libraries, adult reading groups and a poster campaign. Their test of community participation produces “no statistically significant effect on access, learning outcomes, measures of parental support to education, or trust in service providers.” The authors go on to explain, “these weak results could be due to one of two possible factors: the enhanced package provided did not adequately address the reasons why some parents persistently do not send their children to school, or the activities were not implemented at the appropriate ‘dosage’ to be effective.”

What we would like the authors to be able to say, given the investment in conducting this experiment is, “the finding of no statistically significant effect demonstrates that community participation is not an effective enhancement for community-based education in Afghanistan.” But there are indeed open questions about activity selection and implementation.  So I understand why the authors do not feel comfortable drawing this definitive conclusion.

What can we do to make sure that null results do tell us what doesn’t work? I recommend three steps.

First , we should make sure that the intervention we are evaluating is the best possible design based on the underlying theory of change and for the context where we’re working. This recommendation points to the need for formative research . Formative research can include collection of quantitative and/or qualitative data, which can be analyzed using quantitative and/or qualitative methods, and it should produce the information necessary to design the best possible intervention. In this Afghanistan case, formative research would ask the question why are parents not sending their children to school and would explore with these parents and other parents in Afghanistan what services or activities they would like to see. Put differently, a hypothesis is an educated guess, and formative research is the education behind the guess. A strong hypothesis should give us answers both when it is rejected and when it is not rejected. Some examples of formative research conducted to inform intervention design come from the International Initiative for Impact Evaluation’s (3ie’s) HIV Self-testing Thematic Window. In both Kenya (formative research summary here ) and Zambia (formative study to be published soon, overview here ) 3ie provided grants for formative research in advance of funding studies of pilot interventions. This blog post summarizes two formative research studies conducted by the mSTAR project of FHI 360 to inform the design of a mobile money salary payments program in Liberia.

The second step to help ensure answers from null results is to make sure that the intervention as designed can be implemented with fidelity in the context for which it was designed. This step often requires formative evaluation . A formative evaluation involves implementing the intervention on a small scale and collecting data on the implementation itself and about those who receive the intervention. A formative evaluation is not the same as piloting the intervention to test whether the intervention causes the desired outcome or impact—that measurement of an attributable effect requires a counterfactual.

A formative evaluation addresses two questions:

  • can the intervention be implemented, and
  • will the target beneficiaries take it up?

The first question often focuses on implementation process, and the formative evaluation uses process evaluation methods. Sometimes though, it is more a science question, and the formative evaluation is more like a feasibility study looking at whether the technology actually works in the field. Here is an example of a formative evaluation of hearing screening procedures in Ecuadorian schools designed to learn how well a new screening technology can be implemented.

The second formative evaluation question, whether recipients take up the intervention, should not be in too much doubt if the formative research was good. But we often see impact evaluations of new or pilot programs where the main finding is that the participants did not “use” the intervention. In the case of the Afghanistan study, for example, “only about 35% of household respondents and community leaders in enhancement villages report [the presence of] libraries and adult reading groups.” Those take-up data do suggest that the null result from the impact evaluation reflects that community participation didn’t happen more than it reflects that community participation doesn’t work. But we shouldn’t need a big study with a counterfactual to test whether participants will take up a new intervention. We can do that with a formative evaluation.

Why don’t we see more formative research and formative evaluation? They take a lot of time and a lot of money. Often at the point that impact evaluators are brought in, there is already a timeline for starting a project and no patience to spend a year collecting and analyzing formative research data and conducting a formative evaluation of the intervention before the full study is launched. It is not as much of a problem when the study yields positive results, but leaves us with excuses when the study yields null results.

The third step to getting answers from null results is ensuring there is enough statistical power to estimate a meaningful minimum detectable effect. This step is well known, and it is fairly common now to see power calculations in impact evaluation proposals or protocols, at least for the primary outcome of interest. What we rarely see are ex-post power calculations to explore whether the study as implemented still had enough power to measure something meaningful. If the final sample size turns out to be smaller, or other assumed parameters are different in reality (e.g. the inter-cluster correlation is higher) a null effect could be the result of too little power. This finding doesn’t give us the answer that the intervention doesn’t work, but at least we know why we don’t know. Ex-post power calculations can also be useful for understanding null results in tests of heterogeneous outcomes, where the applicable sample is often smaller than the full sample determined by the ex-ante power calculations. Examples of how ex-post power calculations can be used are here and here .

As I said at the beginning, there are many features that make the Burde, Middleton, and Samii evaluation exemplary, features not seen in all (or even most) development impact evaluations:

  • First, the study overall is highly policy relevant. The primary intervention being evaluated, community-based education, is something the Afghan government is looking to take over from the NGOs eventually. The second of the two within-treatment variations is also highly policy relevant. If and when the government does take over community-based education, the minimum qualifications requirement for teachers would likely apply, so how this requirement affects recruitment is extremely relevant.
  • Second, the authors pay particular attention to the ecological validity of the experimental interventions, which improves the usefulness of the pilot for informing the intervention at scale.
  • Third, the study includes a detailed cost effectiveness analysis allowing a comparison of the different community-based education packages in terms of effect per dollar invested.
  • Fourth, the study includes analysis of the influence of other NGO activities in these communities. In spite of random assignment, the presence of related NGO activities is not perfectly balanced across the study arms, so understanding the attributable impact of the intervention requires analysis of other NGO activities.
  • Fifth, the study was conducted in close collaboration with the Afghanistan Ministry of Education, which benefits the policy relevance and the policy take-up.

The high quality of this study means that the authors can make specific policy recommendations based on the positive results that they do find, and this quality was appreciated by the Afghan Deputy Minister of Education in his comments at CGD. But null results happen. They happened here and they happen in many studies. To get the most out of our impact evaluations, we need to set them up so that null results give us answers.

Photo credit: FHI 360

Sharing is caring!

How likely are you to recommend this blog post to a friend or colleague? Select 0 - Not likely 1 2 3 4 5 6 7 8 9 10 - Highly likely

How satisfied are you with what you found on this blog today? Did not meet my needs Somewhat met my needs Met my needs Exceeded my expectations I did not have any specific needs

Please share any other comments or feedback.

Related Posts

experiment null results

Turning lemons into lemonade, and then drinking it: Rigorous evaluation under challenging conditions

Green online submission button

Why should practitioners publish their research in journals?

experiment null results

How to find the journal that is just right

Stay up to date

Never miss an email

  • Monthly summaries only

Our use of cookies

Privacy overview.

CookieDurationDescription
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Null and Alternative Hypotheses | Definitions & Examples

Null & Alternative Hypotheses | Definitions, Templates & Examples

Published on May 6, 2022 by Shaun Turney . Revised on June 22, 2023.

The null and alternative hypotheses are two competing claims that researchers weigh evidence for and against using a statistical test :

  • Null hypothesis ( H 0 ): There’s no effect in the population .
  • Alternative hypothesis ( H a or H 1 ) : There’s an effect in the population.

Table of contents

Answering your research question with hypotheses, what is a null hypothesis, what is an alternative hypothesis, similarities and differences between null and alternative hypotheses, how to write null and alternative hypotheses, other interesting articles, frequently asked questions.

The null and alternative hypotheses offer competing answers to your research question . When the research question asks “Does the independent variable affect the dependent variable?”:

  • The null hypothesis ( H 0 ) answers “No, there’s no effect in the population.”
  • The alternative hypothesis ( H a ) answers “Yes, there is an effect in the population.”

The null and alternative are always claims about the population. That’s because the goal of hypothesis testing is to make inferences about a population based on a sample . Often, we infer whether there’s an effect in the population by looking at differences between groups or relationships between variables in the sample. It’s critical for your research to write strong hypotheses .

You can use a statistical test to decide whether the evidence favors the null or alternative hypothesis. Each type of statistical test comes with a specific way of phrasing the null and alternative hypothesis. However, the hypotheses can also be phrased in a general way that applies to any test.

Prevent plagiarism. Run a free check.

The null hypothesis is the claim that there’s no effect in the population.

If the sample provides enough evidence against the claim that there’s no effect in the population ( p ≤ α), then we can reject the null hypothesis . Otherwise, we fail to reject the null hypothesis.

Although “fail to reject” may sound awkward, it’s the only wording that statisticians accept . Be careful not to say you “prove” or “accept” the null hypothesis.

Null hypotheses often include phrases such as “no effect,” “no difference,” or “no relationship.” When written in mathematical terms, they always include an equality (usually =, but sometimes ≥ or ≤).

You can never know with complete certainty whether there is an effect in the population. Some percentage of the time, your inference about the population will be incorrect. When you incorrectly reject the null hypothesis, it’s called a type I error . When you incorrectly fail to reject it, it’s a type II error.

Examples of null hypotheses

The table below gives examples of research questions and null hypotheses. There’s always more than one way to answer a research question, but these null hypotheses can help you get started.

( )
Does tooth flossing affect the number of cavities? Tooth flossing has on the number of cavities. test:

The mean number of cavities per person does not differ between the flossing group (µ ) and the non-flossing group (µ ) in the population; µ = µ .

Does the amount of text highlighted in the textbook affect exam scores? The amount of text highlighted in the textbook has on exam scores. :

There is no relationship between the amount of text highlighted and exam scores in the population; β = 0.

Does daily meditation decrease the incidence of depression? Daily meditation the incidence of depression.* test:

The proportion of people with depression in the daily-meditation group ( ) is greater than or equal to the no-meditation group ( ) in the population; ≥ .

*Note that some researchers prefer to always write the null hypothesis in terms of “no effect” and “=”. It would be fine to say that daily meditation has no effect on the incidence of depression and p 1 = p 2 .

The alternative hypothesis ( H a ) is the other answer to your research question . It claims that there’s an effect in the population.

Often, your alternative hypothesis is the same as your research hypothesis. In other words, it’s the claim that you expect or hope will be true.

The alternative hypothesis is the complement to the null hypothesis. Null and alternative hypotheses are exhaustive, meaning that together they cover every possible outcome. They are also mutually exclusive, meaning that only one can be true at a time.

Alternative hypotheses often include phrases such as “an effect,” “a difference,” or “a relationship.” When alternative hypotheses are written in mathematical terms, they always include an inequality (usually ≠, but sometimes < or >). As with null hypotheses, there are many acceptable ways to phrase an alternative hypothesis.

Examples of alternative hypotheses

The table below gives examples of research questions and alternative hypotheses to help you get started with formulating your own.

Does tooth flossing affect the number of cavities? Tooth flossing has an on the number of cavities. test:

The mean number of cavities per person differs between the flossing group (µ ) and the non-flossing group (µ ) in the population; µ ≠ µ .

Does the amount of text highlighted in a textbook affect exam scores? The amount of text highlighted in the textbook has an on exam scores. :

There is a relationship between the amount of text highlighted and exam scores in the population; β ≠ 0.

Does daily meditation decrease the incidence of depression? Daily meditation the incidence of depression. test:

The proportion of people with depression in the daily-meditation group ( ) is less than the no-meditation group ( ) in the population; < .

Null and alternative hypotheses are similar in some ways:

  • They’re both answers to the research question.
  • They both make claims about the population.
  • They’re both evaluated by statistical tests.

However, there are important differences between the two types of hypotheses, summarized in the following table.

A claim that there is in the population. A claim that there is in the population.

Equality symbol (=, ≥, or ≤) Inequality symbol (≠, <, or >)
Rejected Supported
Failed to reject Not supported

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

To help you write your hypotheses, you can use the template sentences below. If you know which statistical test you’re going to use, you can use the test-specific template sentences. Otherwise, you can use the general template sentences.

General template sentences

The only thing you need to know to use these general template sentences are your dependent and independent variables. To write your research question, null hypothesis, and alternative hypothesis, fill in the following sentences with your variables:

Does independent variable affect dependent variable ?

  • Null hypothesis ( H 0 ): Independent variable does not affect dependent variable.
  • Alternative hypothesis ( H a ): Independent variable affects dependent variable.

Test-specific template sentences

Once you know the statistical test you’ll be using, you can write your hypotheses in a more precise and mathematical way specific to the test you chose. The table below provides template sentences for common statistical tests.

( )
test 

with two groups

The mean dependent variable does not differ between group 1 (µ ) and group 2 (µ ) in the population; µ = µ . The mean dependent variable differs between group 1 (µ ) and group 2 (µ ) in the population; µ ≠ µ .
with three groups The mean dependent variable does not differ between group 1 (µ ), group 2 (µ ), and group 3 (µ ) in the population; µ = µ = µ . The mean dependent variable of group 1 (µ ), group 2 (µ ), and group 3 (µ ) are not all equal in the population.
There is no correlation between independent variable and dependent variable in the population; ρ = 0. There is a correlation between independent variable and dependent variable in the population; ρ ≠ 0.
There is no relationship between independent variable and dependent variable in the population; β = 0. There is a relationship between independent variable and dependent variable in the population; β ≠ 0.
Two-proportions test The dependent variable expressed as a proportion does not differ between group 1 ( ) and group 2 ( ) in the population; = . The dependent variable expressed as a proportion differs between group 1 ( ) and group 2 ( ) in the population; ≠ .

Note: The template sentences above assume that you’re performing one-tailed tests . One-tailed tests are appropriate for most studies.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

Null and alternative hypotheses are used in statistical hypothesis testing . The null hypothesis of a test always predicts no effect or no relationship between variables, while the alternative hypothesis states your research prediction of an effect or relationship.

The null hypothesis is often abbreviated as H 0 . When the null hypothesis is written using mathematical symbols, it always includes an equality symbol (usually =, but sometimes ≥ or ≤).

The alternative hypothesis is often abbreviated as H a or H 1 . When the alternative hypothesis is written using mathematical symbols, it always includes an inequality symbol (usually ≠, but sometimes < or >).

A research hypothesis is your proposed answer to your research question. The research hypothesis usually includes an explanation (“ x affects y because …”).

A statistical hypothesis, on the other hand, is a mathematical statement about a population parameter. Statistical hypotheses always come in pairs: the null and alternative hypotheses . In a well-designed study , the statistical hypotheses correspond logically to the research hypothesis.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Turney, S. (2023, June 22). Null & Alternative Hypotheses | Definitions, Templates & Examples. Scribbr. Retrieved September 25, 2024, from https://www.scribbr.com/statistics/null-and-alternative-hypotheses/

Is this article helpful?

Shaun Turney

Shaun Turney

Other students also liked, inferential statistics | an easy introduction & examples, hypothesis testing | a step-by-step guide with easy examples, type i & type ii errors | differences, examples, visualizations, what is your plagiarism score.

  • Cancer Biology

Replication of “null results” – Absence of evidence or evidence of absence?

  • Samuel Pawel author has email address

Rachel Heyard

Charlotte micheloud, leonhard held.

  • Epidemiology, Biostatistics and Prevention Institute, Center for Reproducible Science, University of Zurich, Switzerland
  • https://doi.org/ 10.7554/eLife.92311.2
  • Open access
  • Copyright information

In several large-scale replication projects, statistically non-significant results in both the original and the replication study have been interpreted as a “replication success”. Here we discuss the logical problems with this approach: Non-significance in both studies does not ensure that the studies provide evidence for the absence of an effect and “replication success” can virtually always be achieved if the sample sizes are small enough. In addition, the relevant error rates are not controlled. We show how methods, such as equivalence testing and Bayes factors, can be used to adequately quantify the evidence for the absence of an effect and how they can be applied in the replication setting. Using data from the Reproducibility Project: Cancer Biology, the Experimental Philosophy Replicability Project, and the Reproducibility Project: Psychology we illustrate that many original and replication studies with “null results” are in fact inconclusive. We conclude that it is important to also replicate studies with statistically non-significant results, but that they should be designed, analyzed, and interpreted appropriately.

eLife assessment

By assessing what it means to replicate a null finding, and by proposing two methods that can be used to evaluate whether null findings have been replicated (frequentist equivalence testing, and Bayes factors), this article represents an important contribution to work on reproducibility. Through a compelling re-analysis of results from the Reproducibility Project: Cancer Biology, the authors demonstrate that even when 'replication success' is reduced to a single criterion, different methods to assess replication of a null finding can lead to different conclusions.

  • https://doi.org/ 10.7554/eLife.92311.2.sa2

Significance of findings

important : Findings that have theoretical or practical implications beyond a single subfield

  • fundamental

Strength of evidence

compelling : Evidence that features methods, data and analyses more rigorous than the current state-of-the-art

  • exceptional

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Introduction

Absence of evidence is not evidence of absence – the title of the 1995 paper by Douglas Altman and Martin Bland has since become a mantra in the statistical and medical literature ( Altman and Bland, 1995 ). Yet, the misconception that a statistically non-significant result indicates evidence for the absence of an effect is unfortunately still widespread ( Greenland, 2011 ; Makin and de Xivry, 2019 ). Such a “null result” – typically characterized by a p -value p > 0.05 for the null hypothesis of an absent effect – may also occur if an effect is actually present. For example, if the sample size of a study is chosen to detect an assumed effect with a power of 80%, null results will incorrectly occur 20% of the time when the assumed effect is actually present. If the power of the study is lower, null results will occur more often. In general, the lower the power of a study, the greater the ambiguity of a null result. To put a null result in context, it is therefore critical to know whether the study was adequately powered and under what assumed effect the power was calculated ( Hoenig and Heisey, 2001 ; Greenland, 2012 ). However, if the goal of a study is to explicitly quantify the evidence for the absence of an effect, more appropriate methods designed for this task, such as equivalence testing ( Wellek, 2010 ; Lakens, 2017 ; Senn, 2021 ) or Bayes factors ( Kass and Raftery, 1995 ; Goodman, 1999 , 2005 ; Dienes, 2014 ; Keysers et al., 2020 ), should be used from the outset.

The interpretation of null results becomes even more complicated in the setting of replication studies. In a replication study, researchers attempt to repeat an original study as closely as possible in order to assess whether consistent results can be obtained with new data ( National Academies of Sciences, Engineering, and Medicine, 2019 ). In the last decade, various large-scale replication projects have been conducted in diverse fields, from the biomedical to the social sciences ( Prinz et al., 2011 ; Begley and Ellis, 2012 ; Klein et al., 2014 ; Open Science Collaboration, 2015 ; Camerer et al., 2016 , 2018 ; Klein et al., 2018 ; Cova et al., 2018 ; Errington et al., 2021 , among others). The majority of these projects reported alarmingly low replicability rates across a broad spectrum of criteria for quantifying replicability. While most of these projects restricted their focus on original studies with statistically significant results (“positive results”), the Reproducibility Project: Cancer Biology (RPCB, Errington et al., 2021 ), the Experimental Philosophy Replicability Project (EPRP, Cova et al., 2018 ), and the Reproducibility Project: Psychology (RPP, Open Science Collaboration, 2015 ) also attempted to replicate some original studies with null results – either non-significant or interpreted as showing no evidence for a meaningful effect by the original authors.

Although the EPRP and RPP interpreted non-significant results in both original and replication study as a “replication success” for some individual replications (see, for example, the replication of McCann (2005 , replication report: https://osf.io/wcm7n ) or the replication of Ranganath and Nosek (2008 , replication report: https://osf.io/9xt25 )), they excluded the original null results in the calculation of an overall replicability rate based on significance. In contrast, the RPCB explicitly defined null results in both the original and the replication study as a criterion for “replication success”. According to this “non-significance” criterion, 11/15 = 73% replications of original null effects were successful. Four additional criteria were used to provide a more nuanced assessment of replication success for original null results: (i) whether the original effect estimate was included in the 95% confidence interval of the replication effect estimate (success rate 11/15 = 73%), (ii) whether the replication effect estimate was included in the 95% confidence interval of the original effect estimate (success rate 12/15 = 80%), (iii) whether the replication effect estimate was included in the 95% prediction interval based on the original effect estimate (success rate 12/15 = 80%), (iv) and whether the p -value obtained from combining the original and replication effect estimate with a meta-analysis was non-significant (success rate 10/15 = 67%). Criteria (i) to (iii) are useful for assessing compatibility in effect estimates between the original and the replication study. Their suitability has been extensively discussed in the literature. The prediction interval criterion (iii) or equivalent criteria (e.g., the Q -test) are usually recommended because they account for the uncertainty from both studies and have adequate error rates when the true effect sizes are the same ( Patil et al., 2016 ; Mathur and VanderWeele, 2020 ; Schauer and Hedges, 2021 ).

While the effect estimate criteria (i) to (iii) can be applied regardless of whether or not the original study was non-significant, the “meta-analytic non-significance” criterion (iv) and the aforementioned non-significance criterion refer specifically to original null results. We believe that there are several logical problems with both, and that it is important to highlight and address them, especially since the non-significance criterion has already been used in three replication projects without much scrutiny. It is crucial to note that it is not our intention to diminish the enormously important contributions of the RPCB, the EPRP, and the RPP, but rather to build on their work and provide recommendations for ongoing and future replication projects (e.g., Amaral et al., 2019 ; Murphy et al., 2022 ).

The logical problems with the non-significance criterion are as follows: First, if the original study had low statistical power, a non-significant result is highly inconclusive and does not provide evidence for the absence of an effect. It is then unclear what exactly the goal of the replication should be – to replicate the inconclusiveness of the original result? On the other hand, if the original study was adequately powered, a non-significant result may indeed provide some evidence for the absence of an effect when analyzed with appropriate methods, so that the goal of the replication is clearer. However, the criterion by itself does not distinguish between these two cases. Second, with this criterion researchers can virtually always achieve replication success by conducting a replication study with a very small sample size, such that the p -value is non-significant and the result is inconclusive. This is because the null hypothesis under which the p -value is computed is misaligned with the goal of inference, which is to quantify the evidence for the absence of an effect. Third, the criterion does not control the error of falsely claiming the absence of an effect at a predetermined rate. This is in contrast to the standard criterion for replication success, which requires significance from both studies (also known as the two-trials rule, see Section 12.2.8 in Senn, 2021 ), and ensures that the error of falsely claiming the presence of an effect is controlled at a rate equal to the squared significance level (for example, 5% × 5% = 0.25% for a 5% significance level). The non-significance criterion may be intended to complement the two-trials rule for null results. However, it fails to do so in this respect, which may be required by regulators and funders. These logical problems are equally applicable to the meta-analytic non-significance criterion.

In the following, we present two principled approaches for analyzing replication studies of null results – frequentist equivalence testing and Bayesian hypothesis testing – that can address the limitations of the non-significance criterion. We use the null results replicated in the RPCB, RPP, and EPRP to illustrate the problems of the non-significance criterion and how they can be addressed. We conclude the paper with practical recommendations for analyzing replication studies of original null results, including simple R code for applying the proposed methods.

Null results from the Reproducibility Project: Cancer Biology

Figure 1 shows effect estimates on standardized mean difference (SMD) scale with 95% confidence intervals from two RPCB study pairs. In both study pairs, the original and replication studies are “null results” and therefore meet the non-significance criterion for replication success (the two-sided p -values are greater than 0.05 in both the original and the replication study). The same is true when applying the meta-analytic non-significance criterion (the two-sided p -values of the meta-analyses p MA are greater than 0.05). However, intuition would suggest that the conclusions in the two pairs are very different.

experiment null results

Two examples of original and replication study pairs which meet the non-significance replication success criterion from the Reproducibility Project: Cancer Biology ( Errington et al., 2021 ). Shown are standardized mean difference effect estimates with 95% confidence intervals, sample sizes n , and two-sided p -values p for the null hypothesis that the effect is absent. Effect estimate, 95% confidence interval, and p -value from a fixed-effect meta-analysis p MA of original and replication study are shown in gray.

The original study from Dawson et al. (2011) and its replication both show large effect estimates in magnitude, but due to the very small sample sizes, the uncertainty of these estimates is large, too. With such low sample sizes, the results seem inconclusive. In contrast, the effect estimates from Goetz et al. (2011) and its replication are much smaller in magnitude and their uncertainty is also smaller because the studies used larger sample sizes. Intuitively, the results seem to provide more evidence for a zero (or negligibly small) effect. While these two examples show the qualitative difference between absence of evidence and evidence of absence, we will now discuss how the two can be quantitatively distinguished.

Methods for assessing replicability of null results

There are both frequentist and Bayesian methods that can be used for assessing evidence for the absence of an effect. Anderson and Maxwell (2016) provide an excellent summary in the context of replication studies in psychology. We now briefly discuss two possible approaches – frequentist equivalence testing and Bayesian hypothesis testing – and their application to the RPCB, EPRP, and RPP data.

Frequentist equivalence testing

Equivalence testing was developed in the context of clinical trials to assess whether a new treatment – typically cheaper or with fewer side effects than the established treatment – is practically equivalent to the established treatment ( Wellek, 2010 ; Lakens, 2017 ). The method can also be used to assess whether an effect is practically equivalent to an absent effect, usually zero. Using equivalence testing as a way to put non-significant results into context has been suggested by several authors ( Hauck and Anderson, 1986 ; Campbell and Gustafson, 2018 ). The main challenge is to specify the margin Δ > 0 that defines an equivalence range [−Δ, +Δ] in which an effect is considered as absent for practical purposes. The goal is then to reject the null hypothesis that the true effect is outside the equivalence range. This is in contrast to the usual null hypotheses of superiority tests which state that the effect is zero or smaller than zero, see Figure 2 for an illustration.

experiment null results

Null hypothesis ( H 0 ) and alternative hypothesis ( H 1 ) for superiority and equivalence tests (with equivalence margin Δ > 0).

To ensure that the null hypothesis is falsely rejected at most α × 100% of the time, the standard approach is to declare equivalence if the (1-2 α )×100% confidence interval for the effect is contained within the equivalence range, for example, a 90% confidence interval for α = 0.05 ( Westlake, 1972 ). This procedure is equivalent to declaring equivalence when two one-sided tests (TOST) for the null hypotheses of the effect being greater/smaller than +Δ and −Δ, are both significant at level α ( Schuirmann, 1987 ). A quantitative measure of evidence for the absence of an effect is then given by the maximum of the two one-sided p -values – the TOST p -value ( Greenland, 2023 , section 4.4). In case a dichotomous replication success criterion for null results is desired, it is natural to require that both the original and the replication TOST p -values are smaller than some level α (conventionally α = 0.05). Equivalently, the criterion would require the (1 – 2 α ) × 100% confidence intervals of the original and the replication to be included in the equivalence region. In contrast to the non-significance criterion, this criterion controls the error of falsely claiming replication success at level α 2 when there is a true effect outside the equivalence margin, thus complementing the usual two-trials rule in drug regulation ( Senn, 2021 , Section 12.2.8).

Returning to the RPCB data, Figure 3 shows the standardized mean difference effect estimates with 90% confidence intervals for all 15 effects which were treated as null results by the RPCB. 1 Most of them showed non-significant p -values ( p > 0.05) in the original study. It is noteworthy, however, that two effects from the second experiment of the original paper 48 were regarded as null results despite their statistical significance. According to the non-significance criterion (requiring p > 0.05 in original and replication study), there are 11 “successes” out of total 15 null effects, as reported in Table 1 from Errington et al. (2021) .

experiment null results

Effect estimates on standardized mean difference (SMD) scale with 90% confidence interval for the 15 “null results” and their replication studies from the Reproducibility Project: Cancer Biology ( Errington et al., 2021 ). The title above each plot indicates the original paper, experiment and effect numbers. Two original effect estimates from original paper 48 were statistically significant at p < 0.05, but were interpreted as null results by the original authors and therefore treated as null results by the RPCB. The two examples from Figure 1 are indicated in the plot titles. The dashed gray line represents the value of no effect (SMD = 0), while the dotted red lines represent the equivalence range with a margin of Δ = 0.74, classified as “liberal” by Wellek (2010 , Table 1.1). The p -value p TOST is the maximum of the two one-sided p -values for the null hypotheses of the effect being greater/less than +Δ and −Δ, respectively. The Bayes factor BF 01 quantifies the evidence for the null hypothesis H 0 : SMD = 0 against the alternative H 1 : SMD ≠ 0 with normal unit-information prior assigned to the SMD under H 1 .

We will now apply equivalence testing to the RPCB data. The dotted red lines in Figure 3 represent an equivalence range for the margin Δ = 0.74, which Wellek (2010 , Table 1.1) classifies as “liberal”. However, even with this generous margin, only 4 of the 15 study pairs are able to establish replication success at the 5% level, in the sense that both the original and the replication 90% confidence interval fall within the equivalence range (or, equivalently, that their TOST p -values are smaller than 0.05). For the remaining 11 studies, the situation remains inconclusive and there is no evidence for the absence or the presence of the effect. For instance, the previously discussed example from Goetz et al. (2011) marginally fails the criterion ( p TOST = 0.06 in the original study and p TOST = 0.04 in the replication), while the example from Dawson et al. (2011) is a clearer failure ( p TOST = 0.75 in the original study and p TOST = 0.88 in the replication) as both effect estimates even lie outside the equivalence margin.

The post-hoc specification of equivalence margins is controversial. Ideally, the margin should be specified on a case-by-case basis in a pre-registered protocol before the studies are conducted by researchers familiar with the subject matter. In the social and medical sciences, the conventions of Cohen (1992) are typically used to classify SMD effect sizes (SMD = 0.2 small, SMD = 0.5 medium, SMD = 0.8 large). While effect sizes are typically larger in preclinical research, it seems unrealistic to specify margins larger than 1 on SMD scale to represent effect sizes that are absent for practical purposes. It could also be argued that the chosen margin Δ = 0.74 is too lax compared to margins commonly used in clinical research ( Lange and Freitag, 2005 ). We therefore report a sensitivity analysis regarding the choice of the margin in Figure 4 in Appendix A . This analysis shows that for realistic margins between 0 and 1, the proportion of replication successes remains below 50% for the conventional α = 0.05 level. To achieve a success rate of 11/15 = 73%, as was achieved with the non-significance criterion from the RPCB, unrealistic margins of Δ > 2 are required.

Appendix B shows similar equivalence test analyses for the four study pairs with original null results from the RPP and EPRP. Three study pair results turn out to be inconclusive due to the large uncertainty around their effect estimates.

Bayesian hypothesis testing

The distinction between absence of evidence and evidence of absence is naturally built into the Bayesian approach to hypothesis testing. A central measure of evidence is the Bayes factor ( Kass and Raftery, 1995 ; Goodman, 1999 ; Dienes, 2014 ; Keysers et al., 2020 ), which is the updating factor of the prior odds to the posterior odds of the null hypothesis H 0 versus the alternative hypothesis H 1

experiment null results

The Bayes factor BF 01 quantifies how much the observed data have increased or decreased the probability Pr( H 0 ) of the null hypothesis relative to the probability Pr( H 1 ) of the alternative. As such, Bayes factor are direct measures of evidence for the null hypothesis, in contrast to p -values, which are only indirect measures of evidence as they are computed under the assumption that the null hypothesis is true ( Held and Ott, 2018 ). If the null hypothesis states the absence of an effect, a Bayes factor greater than one (BF 01 > 1) indicates evidence for the absence of the effect and a Bayes factor smaller than one indicates evidence for the presence of the effect (BF 01 < 1), whereas a Bayes factor not much different from one indicates absence of evidence for either hypothesis (BF 01 ≈ 1). Bayes factors are quantitative summaries of the evidence provided by the data in favor of the null hypothesis as opposed to the alternative hypothesis. If a dichotomous criterion for successful replication of a null result is desired, it seems natural to require a Bayes factor larger than some level γ > 1 from both studies, for example, γ = 3 or γ = 10 which are conventional levels for “substantial” and “strong” evidence, respectively ( Jeffreys, 1961 ). In contrast to the nonsignificance criterion, this criterion provides a genuine measure of evidence that can distinguish absence of evidence from evidence of absence.

The main challenge with Bayes factors is the specification of the effect under the alternative hypothesis H 1 . The assumed effect under H 1 is directly related to the Bayes factor, and researchers who assume different effects will end up with different Bayes factors. Instead of specifying a single effect, one therefore typically specifies a “prior distribution” of plausible effects. Importantly, the prior distribution, like the equivalence margin, should be determined by researchers with subject knowledge and before the data are collected.

To compute the Bayes factors for the RPCB null results, we used the observed effect estimates as the data and assumed a normal sampling distribution for them ( Dienes, 2014 ), as typically done in a meta-analysis. The Bayes factors BF 01 shown in Figure 3 then quantify the evidence for the null hypothesis of no effect against the alternative hypothesis that there is an effect using a normal “unit-information” prior distribution ( Kass and Wasserman, 1995 ) for the effect size under the alternative H 1 , see Appendix C for further details on the calculation of these Bayes factors. We see that in most cases there is no substantial evidence for either the absence or the presence of an effect, as with the equivalence tests. For instance, with a lenient Bayes factor threshold of 3, only 1 of the 15 replications are successful, in the sense of having BF 01 > 3 in both the original and the replication study. The Bayes factors for the two previously discussed examples are consistent with our intuitions – in the Goetz et al. (2011) example there is indeed substantial evidence for the absence of an effect (BF 01 = 5 in the original study and BF 01 = 4.1 in the replication), while in the Dawson et al. (2011) example there is even anecdotal evidence for the presence of an effect, though the Bayes factors are very close to one due to the small sample sizes (BF 01 = 1/1.1 in the original study and BF 01 = 1/1.8 in the replication).

As with the equivalence margin, the choice of the prior distribution for the SMD under the alternative H 1 is debatable. The normal unit-information prior seems to be a reasonable default choice, as it implies that small to large effects are plausible under the alternative, but other normal priors with smaller/larger standard deviations could have been considered to make the test more sensitive to smaller/larger true effect sizes. The sensitivity analysis in Appendix A therefore also includes an analysis on the effect of varying prior standard deviations and the Bayes factor thresholds. However, again, to achieve replication success for a larger proportion of replications than the observed 1/15 = 7%, unreasonably large prior standard deviations have to be specified.

Of note, among the 15 RPCB null results, there are three interesting cases (the three effects from original paper 48 by Lin et al., 2012 and its replication by Lewis et al., 2018 ) where the Bayes factor is qualitatively different from the equivalence test, revealing a fundamental difference between the two approaches. The Bayes factor is concerned with testing whether the effect is exactly zero , whereas the equivalence test is concerned with whether the effect is within an interval around zero . Due to the very large sample size in the original study ( n = 514) and the replication ( n = 1′153), the data are incompatible with an exactly zero effect, but compatible with effects within the equivalence range. Apart from this example, however, both approaches lead to the same qualitative conclusion – most RPCB null results are highly ambiguous.

Appendix B also shows Bayes factor analyses for the four study pairs with original null results from the RPP and EPRP. In contrast to the RPCB results, most Bayes factors indicate non-anecdotal evidence for a null effect in cases where the non-significance criterion was met, possibly because of the larger sample sizes and smaller effects in these fields.

Conclusions

The concept of “replication success” is inherently multifaceted. Reducing it to a single criterion seems to be an oversimplification. Nevertheless, we believe that the “non-significance” criterion – declaring a replication as successful if both the original and the replication study produce nonsignificant results – is not fit for purpose. This criterion does not ensure that both studies provide evidence for the absence of an effect, it can be easily achieved for any outcome if the studies have sufficiently small sample sizes, and it does not control the relevant error rates. While it is important to replicate original studies with null results, we believe that they should be analyzed using more informative approaches. Box 1 summarizes our recommendations.

experiment null results

The effect size θ 0 represents the value of no effect, typically θ 0 = 0.

Equivalence test

Specify a margin Δ > 0 that defines an equivalence range [ θ 0 − Δ, θ 0 + Δ] in which effects are considered absent for practical purposes.

Compute the TOST p -values for original ( i = o ) and replication ( i = r ) data

experiment null results

with Φ(⋅) the cumulative distribution function of the standard normal distribution.

experiment null results

Declare replication success at level α if p TOST, o ≤ α and p TOST, r ≤ α , conventionally α = 0.05.

Perform a sensitivity analysis with respect to the margin Δ. For example, visualize the TOST p -values for different margins to assess the robustness of the conclusions.

Bayes factor

Specify a prior distribution for the effect size θ that represents plausible values under the alternative hypothesis that there is an effect ( H 1 : θ ≠ θ 0 ). For example, specify the mean m and standard deviation s of a normal distribution θ | H 1 ∼ N( m, s 2 ).

Compute the Bayes factors contrasting H 0 : θ = θ 0 to H 1 : θ ≠ θ 0 for original ( i = o ) and replication ( i = r ) data. Assuming a normal prior distribution, the Bayes factor is

experiment null results

Declare replication success at level γ > 1 if BF 01, o ≥ γ and BF 01, r ≥ γ , conventionally γ = 3 (substantial evidence) or γ = 10 (strong evidence).

Perform a sensitivity analysis with respect to the prior distribution. For example, visualize the Bayes factors for different prior standard deviations to assess the robustness of the conclusions.

Our reanalysis of the RPCB studies with original null results showed that for most studies that meet the non-significance criterion, the conclusions are much more ambiguous – both with frequentist and Bayesian analyses. While the exact success rate depends on the equivalence margin and the prior distribution, our sensitivity analyses show that even with unrealistically liberal choices, the success rate remains below 40% which is substantially lower than the 73% success rate based on the non-significance criterion.

This is not unexpected, as a study typically requires larger sample sizes to detect the absence of an effect than to detect its presence ( Matthews, 2006 , Section 11.5.3). Of note, the RPCB sample sizes were chosen so that each replication had at least 80% power to detect the original effect estimate based on a standard superiority test. However, the design of replication studies should ideally align with the planned analysis ( Anderson and Kelley, 2022 ) so if the goal of the study is to find evidence for the absence of an effect, the replication sample size should be determined based on a test for equivalence, see Flight and Julious (2015) and Pawel et al. (2023) for frequentist and Bayesian approaches, respectively.

Our reanalysis of the RPP and EPRP studies with original null results showed that Bayes factors indeed indicate some evidence for no effect in cases where the non-significance criterion was satisfied, possibly due to the smaller effects and typically larger sample sizes in these fields compared to cancer biology. On the other hand, in most cases the precision of the effect estimates was still limited so that only one study pair achieved replication success with the equivalence testing approach. However, it is important to note that the conclusions from the RPP and EPRP analyses are merely anecdotal, as there were only four study pairs with original null results to analyze.

For both the equivalence test and the Bayes factor approach, it is critical that the equivalence margin and the prior distribution are specified independently of the data, ideally before the original and replication studies are conducted. Typically, however, the original studies were designed to find evidence for the presence of an effect, and the goal of replicating the “null result” was formulated only after failure to do so. It is therefore important that margins and prior distributions are motivated from historical data and/or field conventions ( Campbell and Gustafson, 2021 ), and that sensitivity analyses regarding their choice are reported.

In addition, when analyzing a single pair of original and replication studies, we recommend interpreting Bayes factors and TOST p -values as quantitative measures of evidence and discourage dichotomizing them into “success” or “failure”. For example, two TOST p -values p TOST = 0.049 and p TOST = 0.051 carry similar evidential weight regardless of one being slightly smaller and the other being slightly larger than 0.05. On the other hand, when more than one pair of original and replication studies are analyzed, dichotomization may be required for computing an overall success rate. In this case, the rate may be computed for different thresholds that correspond to qualitatively different levels of evidence (e.g., 1, 3, and 10 for Bayes factors, or 0.05, 0.01, and 0.005 for p -values).

Researchers may also ask whether the equivalence test or the Bayes factor is “better”. We believe that this is the wrong question to ask, because both methods address different questions and are better in different senses; the equivalence test is calibrated to have certain frequentist error rates, which the Bayes factor is not. The Bayes factor, on the other hand, seems to be a more natural measure of evidence as it treats the null and alternative hypotheses symmetrically and represents the factor by which rational agents should update their beliefs in light of the data. Replication success is ideally evaluated along multiple dimensions, as nicely exemplified by the RPCB, EPRP, and RPP. Replications that are successful on multiple criteria provide more convincing support for the original finding, while replications that are successful on fewer criteria require closer examination. Fortunately, the use of multiple methods is already standard practice in replication assessment, so our proposal to use both of them does not require a major paradigm shift.

While the equivalence test and the Bayes factor are two principled methods for analyzing original and replication studies with null results, they are not the only possible methods for doing so. A straightforward extension would be to first synthesize the original and replication effect estimates with a meta-analysis, and then apply the equivalence and Bayes factor tests to the meta-analytic estimate similar to the meta-analytic non-significance criterion used by the RPCB. This could potentially improve the power of the tests, but consideration must be given to the threshold used for the p -values/Bayes factors, as naive use of the same thresholds as in the standard approaches may make the tests too liberal ( Shun et al., 2005 ). Furthermore, there are various advanced methods for quantifying evidence for absent effects which could potentially improve on the more basic approaches considered here ( Lindley, 1998 ; Johnson and Rossell, 2010 ; Morey and Rouder, 2011 ; Kruschke, 2018 ; Stahel, 2021 ; Micheloud and Held, 2023 ; Izbicki et al., 2023 ).

Acknowledgements

We thank the RPCB, EPRP, and RPP contributors for their tremendous efforts and for making their data publicly available. We thank Maya Mathur for helpful advice on data preparation. We thank Benjamin Ineichen for helpful comments on drafts of the manuscript. We thank the three reviewers and the reviewing editor for useful comments that substantially improved the paper. Our acknowledgment of these individuals does not imply their endorsement of our work. We thank the Swiss National Science Foundation for financial support (grant #189295).

Software and data

The code and data to reproduce our analyses is openly available at https://gitlab.uzh.ch/samuel.pawel/rsAbsence . A snapshot of the repository at the time of writing is available at https://doi.org/10.5281/zenodo.7906792 . We used the statistical programming language R version 4.3.2 ( R Core Team, 2022 ) for analyses. The R packages ggplot2 ( Wickham, 2016 ), dplyr ( Wickham et al., 2022 ), knitr ( Xie, 2022 ), and reporttools ( Rufibach, 2009 ) were used for plotting, data preparation, dynamic reporting, and formatting, respectively. The data from the RPCB were obtained by downloading the files from https://github.com/mayamathur/rpcb (commit a1e0c63) and extracting the relevant variables as indicated in the R script preprocess-rpcb-data.R which is available in our git repository. The RPP and EPRP data were obtained from the RProjects data set available in the R package ReplicationSuccess ( Held, 2020 ), see the package documentation ( https://CRAN.R-project.org/package=ReplicationSuccess ) for details on data extraction.

Appendix A: Sensitivity analyses

The post-hoc specification of equivalence margins Δ and prior distribution for the SMD under the alternative H 1 is debatable. Commonly used margins in clinical research are much more stringent ( Lange and Freitag, 2005 ); for instance, in oncology, a margin of Δ = log(1.3) is commonly used for log odds/hazard ratios, whereas in bioequivalence studies a margin of Δ = log(1.25) is the convention ( Senn, 2021 , Chapter 22). These margins would translate into margins of Δ = 0.14 and Δ = 0.12 on the SMD scale, respectively, using the SMD = (√3/π)log OR conversion ( Cooper et al., 2019 , p. 233). Similarly, for the Bayes factor we specified a normal unit-information prior under the alternative while other normal priors with smaller/larger standard deviations could have been considered. Here, we therefore investigate the sensitivity of our conclusions with respect to these parameters.

The top plot of Figure 4 shows the number of successful replications as a function of the margin Δ and for different TOST p -value thresholds. Such an “equivalence curve” approach was first proposed by Hauck and Anderson (1986) . We see that for realistic margins between 0 and 1, the proportion of replication successes remains below 50% for the conventional α = 0.05 level. To achieve a success rate of 11/15 = 73%, as was achieved with the non-significance criterion from the RPCB, unrealistic margins of Δ > 2 are required. Changing the success criterion to a more lenient level ( α = 0.1) or a more stringent level ( α = 0.01) hardly changes the conclusion.

experiment null results

Number of successful replications of original null results in the RPCB as a function of the margin Δ of the equivalence test ( p TOST ≤ α in both studies for α = 0.1, 0.05, 0.01) or the standard deviation of the zero-mean normal prior distribution for the SMD effect size under the alternative H 1 of the Bayes factor test (BF 01 ≥ γ in both studies for γ = 3, 6, 10).

The bottom plot of Figure 4 shows a sensitivity analysis regarding the choice of the prior standard deviation and the Bayes factor threshold. It is uncommon to specify prior standard deviations larger than the unit-information standard deviation of 2, as this corresponds to the assumption of very large effect sizes under the alternatives. However, to achieve replication success for a larger proportion of replications than the observed 1/15 = 7%, unreasonably large prior standard deviations have to be specified. For instance, a standard deviation of roughly 5 is required to achieve replication success in 50% of the replications at a lenient Bayes factor threshold of γ = 3. The standard deviation needs to be almost 20 so that the same success rate 11/15 = 73% as with the nonsignificance criterion is achieved. The necessary standard deviations are even higher for stricter Bayes factor thresholds, such as γ = 6 or γ = 10.

Appendix B: Null results from the RPP and EPRP

Here we perform equivalence test and Bayes factor analyses for the three original null results from the Reproducibility Project: Psychology ( Eastwick and Finkel, 2008 ; Ranganath and Nosek, 2008 ; Reynolds and Besner, 2008 ) and the original null result from the Reproducibility Project: Experimental Philosophy ( McCann, 2005 ). To enable comparison of effect sizes across different studies, both the RPP and the EPRP provided effect estimates as Fisher z -transformed correlations which we use in the following.

Figure 5 shows effect estimates with 90% confidence intervals, two-sided p -values for the null hypothesis that the effect size is zero, TOST p -values for a margin of Δ = 0.2, and Bayes factors using a normal prior centered around zero with a standard deviation of 2. We see that all replications except the replication of Ranganath and Nosek (2008) would be considered successful with the non-significance criterion, as the original and replication p -values are greater than 0.05. In all three cases, the Bayes factors also indicate substantial (BF 01 > 3) to strong evidence (BF 01 > 10) for the null hypothesis of no effect. Compared to the Bayes factors in the RPCB, the evidence is stronger, possibly due to the mostly larger sample sizes in the RPP and EPRP.

experiment null results

Effect estimates on Fisher z -transformed correlation scale with 90% confidence interval for the “null results” and their replication studies from the Reproducibility Project: Psychology (RPP, Open Science Collaboration, 2015 ) and the Experimental Philosophy Replicability Project (EPRP, Cova et al., 2018 ). The dashed gray line represents the value of no effect ( z = 0), while the dotted red lines represent the equivalence range with a margin of Δ = 0.74. The p -value p TOST is the maximum of the two one-sided p -values for the null hypotheses of the effect being greater/less than +Δ and −Δ, respectively. The Bayes factor BF 01 quantifies the evidence for the null hypothesis H 0 : z = 0 against the alternative H 1 : z ≠ 0 with normal prior centered around zero and standard deviation of 2 assigned to the effect size under H 1 .

Interestingly, the opposite conclusion is reached when we analyze the data using an equivalence test with a margin of Δ = 0.2 (which may be considered liberal as it represents a small to medium effect based on the Cohen, 1992 convention). In this case, equivalence at the 5% level can only be established for the Ranganath and Nosek (2008) original study and its replication simultaneously, as the confidence intervals from the other studies are too wide to be included in the equivalence range. Furthermore, the Ranganath and Nosek (2008) replication also illustrates the conceptual difference between testing for an exactly zero effect versus testing for an effect within an interval around zero . That is, the Bayes factor indicates no evidence for a zero effect (because the estimate is clearly not zero), but the equivalence test indicates evidence for a negligible effect (because the estimate is clearly within the equivalence range).

As before, the particular choices of the equivalence margin Δ for the equivalence test and prior standard deviation of the Bayes factor are debatable. We therefore report sensitivity analyses in Figure 6 which show the TOST p -values and Bayes factors of original and replication studies for a range of margins and prior standard deviations, respectively. Apart from the Ranganath and Nosek (2008) study pair, all studies require large margins of about Δ = 0.4 to establish replication success at the 5% level (in the sense of original and replication TOST p -values being smaller than 0.05). On the other hand, in all but the Ranganath and Nosek (2008) replication, the data provide substantial evidence for a null effect (BF 01 > 3) for prior standard deviations of about one, while larger prior standard deviations of about three are required for the data to indicate strong evidence (BF 01 > 10) for a null effect, whereas the data from the Ranganath and Nosek (2008) replication provide very strong evidence against a null effect for all prior standard deviations considered.

experiment null results

Sensitivity analyses for the “null results” and their replication studies from the Reproducibility Project: Psychology (RPP, Open Science Collaboration, 2015 ) and the Experimental Philosophy Replicability Project (EPRP, Cova et al., 2018 ). The Bayes factor of the replication of Ranganath and Nosek (2008) decreases very quickly and is only shown for a limited range.

Appendix C: Technical details on Bayes factors

We assume that effect estimates are normally distributed around an unknown effect size θ with known variance equal to their squared standard error, i.e.,

experiment null results

for original ( i = o ) and replication ( i = r ). This framework is similar to meta-analysis and can be applied to many types of effect sizes and data ( Spiegelhalter et al., 2004 , Section 2.4). We want to quantify the evidence for the null hypothesis that the effect size is equal to a null effect ( H 0 : θ = θ 0 , typically θ 0 = 0) against the alternative hypothesis that the effect size is non-null ( H 1 : θ ≠ θ 0 ). This requires specification of a prior distribution for the effect size under the alternative, and we will assume a normal prior θ | H 1 ~ N( m, s 2 ) in the following. The Bayes factor based on an effect estimate is then given by the ratio of its likelihood under the null hypothesis to its marginal likelihood under the alternative hypothesis, i.e.,

experiment null results

  • Altman D. G.
  • Bland J. M.
  • Amaral O. B.
  • Wasilewska-Sampaio A. P.
  • Carneiro C. F.
  • Anderson S. F.
  • Maxwell S. E.
  • Begley C. G.
  • Ellis L. M.
  • Camerer C. F.
  • Johannesson M.
  • Kirchler M.
  • Almenberg J.
  • Holzmeister F.
  • Campbell H.
  • Gustafson P.
  • Hedges L. V.
  • Valentine J. C.
  • Strickland B.
  • Abatista A.
  • Berniūnas R.
  • Boudesseul J.
  • Dawson M. A.
  • Prinjha R. K.
  • Dittmann A.
  • Giotopoulos G.
  • Bantscheff M.
  • Robson S. C.
  • wa Chung C.
  • Savitski M. M.
  • Huthmacher C.
  • Chapman T. D.
  • Roberts E. J.
  • Soden P. E.
  • Auger K. R.
  • Burnett A. K.
  • Huntly B. J. P.
  • Kouzarides T.
  • Eastwick P. W.
  • Finkel E. J.
  • Errington T. M.
  • Soderberg C. K.
  • Nosek B. A.
  • Julious S. A.
  • Goetz J. G.
  • Navarro-Lérida I.
  • Lazcano J. J.
  • Samaniego R.
  • Osteso-Ibáñez T.
  • Pellinen T.
  • Klein-Szanto A. J.
  • Keely P. J.
  • Sánchez-Mateos P.
  • Cukierman E.
  • Pozo M. A. D.
  • Goodman S. N.
  • Greenland S.
  • Hauck W. W.
  • Anderson S.
  • Hoenig J. M.
  • Heisey D. M.
  • Cabezas L. M. C.
  • Colugnatti F. A. B.
  • Lassance R. F. L.
  • de Souza A. A. L.
  • Stern R. B.
  • Jeffreys H.
  • Johnson V. E.
  • Raftery A. E.
  • Wasserman L.
  • Wagenmakers E.-J.
  • Klein R. A.
  • Ratliff K. A.
  • Vianello M.
  • Adams R. B.
  • Bernstein M. J.
  • Brandt M. J.
  • Hasselman F.
  • Adams B. G.
  • Reginald B.
  • Babalola M. T.
  • Kruschke J. K.
  • Lewis L. M.
  • Edwards M. C.
  • Meyers Z. R.
  • Talbot C. C.
  • Paranal R. M.
  • Burge C. B.
  • Bradner J. E.
  • Young R. A.
  • Lindley D. V.
  • Makin T. R.
  • de Xivry J.-J. O.
  • Mathur M. B.
  • VanderWeele T. J.
  • Matthews J. N.
  • McCann H. J.
  • Micheloud C.
  • Morey R. D.
  • Rouder J. N.
  • Mesquida C.
  • Caldwell A. R.
  • Warne J. P.
  • National Academies of Sciences, Engineering, and Medicine
  • Open Science Collaboration
  • Consonni G.
  • Schlange T.
  • Asadullah K.
  • R Core Team
  • Ranganath K. A.
  • Reynolds M.
  • Rufibach K.
  • Schauer J. M.
  • Schuirmann D. J.
  • Durrleman S.
  • Spiegelhalter D. J.
  • Myles J. P.
  • Stahel W. A.
  • Westlake W. J.
  • Francois R.

Article and author information

Samuel pawel, for correspondence:, version history.

  • Preprint posted : May 8, 2023
  • Sent for peer review : September 8, 2023
  • Reviewed Preprint version 1 : November 22, 2023
  • Reviewed Preprint version 2 : February 16, 2024
  • Version of Record published : May 13, 2024

© 2023, Pawel et al.

This article is distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use and redistribution provided that the original author and source are credited.

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Be the first to read new articles from eLife

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • 25 February 2020

In praise of replication studies and null results

You have full access to this article via your institution.

Timothy Ray Brown

Timothy Brown was the first person with HIV to be successfully treated with transplants of HIV-resistant stem cells. A case study published in Nature replicated this result in another patient, showing that Brown’s cure was not an anomaly. Credit: T.J. Kirkpatrick/Getty

The Berlin Institute of Health last year launched an initiative with the words, “Publish your NULL results — Fight the negative publication bias! Publish your Replication Study — Fight the replication crises!”

The institute is offering its researchers €1,000 (US$1,085) for publishing either the results of replication studies — which repeat an experiment — or a null result, in which the outcome is different from that expected. But Twitter, it seems, took more notice than the thousands of eligible scientists at the translational-research institute. The offer to pay for such studies has so far attracted only 22 applicants — all of whom received the award.

Replication studies are important. How else could researchers be sure that some experimental drugs fail to relieve disease symptoms in mice? Or that current experimental tests of dark-matter theories are drawing a blank 1 ? But publishing this work is not always a priority for researchers, funders or editors — something that must change.

experiment null results

Irreproducibility is not a sign of failure, but an inspiration for fresh ideas

Two strategies can be used to encourage the publication of such work. First, institutions should actively encourage their researchers, through both words and actions. Aside from offering cash upfront, the Berlin Institute of Health has an app and advisers to help researchers to work out which journals, preprint servers and other outlets they should be contacting to publish replication studies and data. The app includes information on estimated publication fees and timelines, as well as formatting and peer-review requirements.

Second, more journals need to emphasize to the research community the benefits of publishing replications and null results. Researchers who report null results can help to steer grant money towards more fruitful studies. Replications are also invaluable for establishing what is necessary for reliable measurements, such as fundamental constants. And wider airing of null results will, eventually, prompt communities to revise their theories to better accommodate reality.

At Nature , replication studies are held to the same high standards as all our published papers. We welcome the submission of studies that provide insights into previously published results; those that can move a field forwards and those that might provide evidence of a transformative advance. We also encourage the research community as a whole to view the publication of such work as having real value.

For example, it had been thought that the presence of bacteria in the human placenta caused complications in pregnancy such as pre-eclampsia. However, a paper published last year found no evidence for the presence of a microbiome in the placenta 2 , suggesting that researchers might need to look elsewhere to understand such conditions.

experiment null results

Rewarding negative results keeps science on track

Null results are also an essential counterweight to the scientific focus on positive findings, which can cause conclusions to be accepted too readily. This is what happened when repeated studies suggested that a variation called 5-HTTLPR that affects the expression of the serotonin transporter gene was a major contributor to depression. For many, it took a comprehensive null study to shake entrenched assumptions that this is not the case 3 .

By contrast, confirmatory replications give researchers confidence that work is reliable and worth building on. This was the case when Nature published a case study describing someone with HIV who went into remission after receiving transplants of HIV-resistant stem cells. That person — who did not wish to be identified and has come to be known as the ‘London patient’ — was not the first to be freed of the virus in this way. That was Timothy Brown, who was treated in Berlin, but the work with the London patient showed that Brown’s cure was not an anomaly 4 .

Not all null results and replications are equally important or informative, but, as a whole, they are undervalued. If researchers assume that replications or null results will be dismissed, then it is our role as journals to show that this is not the case. At the same time, more institutions and funders must step up and support replications — for example, by explicitly making them part of evaluation criteria. We can all do more. Change cannot come soon enough.

Nature 578 , 489-490 (2020)

doi: https://doi.org/10.1038/d41586-020-00530-6

Smorra, C. et al. Nature 575 , 310–314 (2019).

Article   PubMed   Google Scholar  

de Goffau, M. C. et al. Nature 572 , 329–334 (2019).

Border, R. et al. Am. J. Psychiatry 176 , 376–387 (2019).

Gupta, R. K. et al. Nature 568 , 244–248 (2019).

Download references

Reprints and permissions

Related Articles

experiment null results

  • Research management

More measures needed to ease funding competition in China

Correspondence 24 SEP 24

Gender inequity persists among journal chief editors

The human costs of the research-assessment culture

The human costs of the research-assessment culture

Career Feature 09 SEP 24

Data integrity concerns flagged in 130 women’s health papers — all by one co-author

Data integrity concerns flagged in 130 women’s health papers — all by one co-author

News 25 SEP 24

‘Substandard and unworthy’: why it’s time to banish bad-mannered reviews

‘Substandard and unworthy’: why it’s time to banish bad-mannered reviews

Career Q&A 23 SEP 24

Faculty Positions in Westlake University

Founded in 2018, Westlake University is a new type of non-profit research-oriented university in Hangzhou, China, supported by public a...

Hangzhou, Zhejiang, China

Westlake University

experiment null results

Faculty Position - Open Rank - Assistant/Associate/Professor

Atlanta, Georgia

Emory University

experiment null results

Editor (Quantum Physics and Quantum Technologies)

The Associate/Senior Editor role at Nature Communications is ideal for researchers who love science but feel that a career at the bench isn’t enough.

London or Madrid – hybrid working model

Springer Nature Ltd

experiment null results

PhD position (all genders) in AI for biomedical data analysis

PhD position (all genders) in AI for biomedical data analysis Part time  | Temporary | Arbeitsort: Hamburg-Eppendorf UKE_Zentrum für Molekulare Neu...

Hamburg (DE)

Universitätsklinikum Hamburg-Eppendorf

experiment null results

Assistant Professor Position

The Lewis-Sigler Institute at Princeton University invites applications for a tenure-track faculty position at the Assistant Professor level.

Princeton University, Princeton, New Jersey, US

The Lewis-Sigler Institute for Integrative Genomics at Princeton University

experiment null results

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies
  • Translators
  • Graphic Designers

Solve

Please enter the email address you used for your account. Your sign in information will be sent to your email address after it has been verified.

How to Write About Negative (Or Null) Results in Academic Research

ScienceEditor

Researchers are often disappointed when their work yields "negative" results, meaning that the null hypothesis cannot be rejected. However, negative results are essential for research to progress. Negative results tell researchers that they are on the wrong path, or that their current techniques are ineffective. This is a natural and necessary part of discovering something that was previously unknown. Solving problems that lead to negative results is an integral part of being an effective researcher. Publishing negative results that are the result of rigorous research contributes to scientific progress.

There are three main reasons for negative results:

  • The original hypothesis was incorrect
  • The findings of a published report cannot be replicated
  • Technical problems

Here, we will discuss how to write about negative results, first focusing on the most common reason: technical problems.

Writing about technical problems

Technical problems might include faulty reagents, inappropriate study design, and insufficient statistical power. Most researchers would prefer to resolve technical problems before presenting their work, and focus instead on their convincing results. In reality, researchers often need to present their work at a conference or to a thesis committee before some problems can be resolved.

When presenting at a conference, the objective should be to clearly describe your overall research goal and why it is important, your preliminary results, the current problem, and how previously published work is informing the steps you are taking to resolve the problem. Here, you want to take advantage of the collective expertise at the conference. By being straightforward about your difficulties, you increase the chance that someone can help you find a solution.

When presenting to a thesis committee, much of what you discuss will be the same (overall research goal and why it is important, results, problem(s) and possible solutions). Your primarily goal is to show that you are well prepared to move forward in your research career, despite the recent difficulties. The thesis defense is a defined stopping point, so most thesis students should write about solutions they would pursue if they were to continue the work. For example, "To resolve this problem, it would be advisable to increase the survey area by a factor of 4, and then…" In contrast, researchers who will be continuing their work should write about possible solutions using present and future tense. For example, "To resolve this problem, we are currently testing a wider variety of standards, and will then conduct preliminary experiments to determine…"

Putting the "re" in "research"

Whether you are presenting at a conference, defending a thesis, applying for funding, or simply trying to make progress in your research, you will often need to search through the academic literature to determine the best path forward. This is especially true when you get unexpected results—either positive or negative. When trying to resolve a technical problem, you should often find yourself carefully reading the materials and methods sections of papers that address similar research questions, or that used similar techniques to explore very different problems. For example, a single computer algorithm might be adapted to address research questions in many different fields.

In searching through published papers and less formal methods of communication—such as conference abstracts—you may come to appreciate the important details that good researchers will include when discussing technical problems or other negative results. For example, "We found that participants were more likely to complete the process when light refreshments were provided between the two sessions." By including this information, the authors may help other researchers save time and resources.

Thus, you are advised to be as thorough as possible in reviewing the relevant literature, to find the most promising solutions for technical problems. When presenting your work, show that you have carefully considered the possibilities, and have developed a realistic plan for moving forward. This will help a thesis committee view your efforts favorably, and can also convince possible collaborators or advisors to invest time in helping you.

Publishing negative results

Negative results due to technical problems may be acceptable for a conference presentation or a thesis at the undergraduate or master's degree level. Negative results due to technical problems are not sufficient for publication, a Ph.D. dissertation, or tenure. In those situations, you will need to resolve the technical problem and generate high quality results (either positive or negative) that stand up to rigorous analysis. Depending on the research field, high quality negative results might include multiple readouts and narrow confidence intervals.

Researchers are often reluctant to publish negative results, especially if their data don't support an interesting alternative hypothesis. Traditionally, journals have been reluctant to publish negative results that are not paired with positive results, even if the study is well designed and the results have sufficient statistical power. This is starting to change— especially for medical research —but publishing negative results can still be an uphill battle.

Not publishing high quality negative results is a disservice to the scientific community and the people who support it (including tax payers), since other scientists may need to repeat the work. For studies involving animal research or human tissue samples, not publishing would squander significant sacrifices. For research involving medical treatments—especially studies that contradict a published report—not publishing negative results leads to an inaccurate understanding of treatment efficacy.

So how can researchers write about negative results in a way that reflects its importance? Let's consider a common reason for negative results: the original hypothesis was incorrect.

Writing about negative results when the original hypothesis was incorrect

Researchers should be comfortable with being wrong some of the time, such as when results don't support an initial hypothesis. After all, research wouldn't be necessary if we already knew the answer to every possible question. The next step is usually to revise the hypothesis after reconsidering the available data, reading through the relevant literature, and consulting with colleagues.

Ideally, a revised hypothesis will lead to results that allow you to reject a (revised) null hypothesis. The negative results can then be reported alongside the positive results, possibly bolstering the significance of both. For example, "The DNA mutations in region A had a significant effect on gene expression, while the mutations outside of domain A had no effect. Don't forget to include important details about how you overcame technical problems, so that other researchers don't need to reinvent the wheel.

Unfortunately, it isn't always possible to pair negative results with related positive results. For example, imagine a year-long study on the effect of COVID-19 shelter-in-place orders on the mental health of avid video game players compared to people who don't play video games. Despite using well-established tools for measuring mental health, having a large sample size, and comparing multiple subpopulations (e.g. gamers who live alone vs. gamers who live with others), no significant differences were identified. There is no way to modify and repeat this study because the same shelter-in-place conditions no longer exist. So how can this research be presented effectively?

Writing when you only have negative results

When you write a scientific paper to report negative results, the sections will be the same as for any other paper: Introduction, Materials and Methods, Results and Discussion. In the introduction, you should prepare your reader for the possibility of negative results. You can highlight gaps or inconsistencies in past research, and point to data that could indicate an incomplete understanding of the situation.

In the example about video game players, you might highlight data showing that gamers are statistically very similar to large chunks of the population in terms of age, education, marital status, etc. You might discuss how the stigma associated with playing video games might be unfair and harmful to people in certain situations. You could discuss research showing the benefits of playing video games, and contrast gaming with engaging in social media, which is another modern hobby. Putting a positive spin on negative results can make the difference between a published manuscript and rejection.

In a paper that focuses on negative results—especially one that contradicts published findings—the research design and data analysis must be impeccable. You may need to collaborate with other researchers to ensure that your methods are sound, and apply multiple methods of data analysis.

As long as the research is rigorous, negative results should be used to inform and guide future experiments. This is how science improves our understanding of the world.

Why Scientists Should Celebrate Failed Experiments

No losers here: all data is good data

R eporters hate facts that are too good to check—as the phrase in the industry goes. The too-good-to-check fact is the funny or ironic or otherwise delicious detail that just ignites a story and that, if it turns out not to be true, would leave the whole narrative poorer for its absence. It must be checked anyway, of course, and if it doesn’t hold up it has to be cut—with regrets maybe, but cut all the same.

Social scientists face something even more challenging. They develop an intriguing hypothesis, devise a study to test it, assemble a sample group, then run the experiment. If the theory is proven, off goes your paper to the most prestigious journals you can think of. But what if it isn’t proven? Suppose the answer to a thought-provoking question like, “Do toddlers whose parents watch football or other violent sports become more physically aggressive?” turns out to be simply, “Nope.”

Do you still try to publish these so-called null results? Do you even go to the bother of writing them up—an exceedingly slow and painstaking process regardless of what the findings are? Or do you just go on to something else, assuming that no one’s going to be interested in a cool idea that turns out not to be true?

That’s a question that plagues whole fields of science, raising the specter of what’s known as publishing bias—scientists self-censoring so that they effectively pick and choose what sees print and what doesn’t. There’s nothing fraudulent or unethical about dropping an experiment that doesn’t work out as you thought it would, but it does come at a cost. Null results, after all, are still results, and once they’re in the literature, they help other researchers avoid experimental avenues that have already proven to be dead ends. Now a new paper in the journal Science , conducted by a team of researchers at Stanford University, shows that publication bias in the social sciences may be more widespread than anyone knew.

The investigators looked at 221 studies conducted from 2002 to 2012 and made available to them by a research collective known as TESS (Time-Sharing Experiments in the Social Sciences), a National Science Foundation program that makes it easier for researchers to assemble a nationally representative sample group. The best thing about TESS—at least for studies of publication bias–is that the complete history of every experiment is available and searchable, whether it was ever published or not.

When the Stanford investigators reviewed the papers, they found just what they suspected—and feared. Roughly 50% of the 221 studies wound up seeing publication, but that total included only 20% of the ones with null results. That compared unfavorably to the 60% of those studies with strong positive results that were published, and the 50% with mixed results. Worse, one of the reasons so few null results ever saw print is that a significant majority of them, 65%, were never even written up in the first place.

The Stanford investigators went one more—very illuminating—step and contacted as many of the researchers of the null studies as they could via e-mail, asking them why they had not proceeded with the studies. Among the answers: “The unfortunate reality of the publishing world [is] that null effects do not tell a clear story.” There was also: “We determined that there was nothing there that we could publish in a professional journal” and “[the study] was mostly a disappointing wash.” Added one especially baleful scientist: “[The] data were buried in the graveyard of statistical findings.” Among all of the explanations, however, the most telling—if least colorful—was this: “The hypotheses of the study were not confirmed.”

That, all by itself, lays bare the misguided thinking behind publication bias. No less a researcher than Jonas Salk once argued to his lab staff that there is no such thing as a failed experiment, because learning what doesn’t work is a necessary step to learning what does. Salk, history showed, did pretty well for himself. Social scientists—disappointed though they may sometimes be—might want to follow his lead.

More Must-Reads from TIME

  • Iran, Trump, and the Third Assassination Plot
  • Ellen DeGeneres’ Unfunny Netflix Special Leaves So Much Unsaid
  • Welcome to the Golden Age of Scams
  • Did the Pandemic Break Our Brains?
  • 33 True Crime Documentaries That Shaped the Genre
  • The Ordained Rabbi Who Bought a Porn Company
  • Introducing the Democracy Defenders
  • Why Gut Health Issues Are More Common in Women

Write to Jeffrey Kluger at [email protected]

Detecting Change Points of Covariance Matrices in High Dimensions

Testing for change points in sequences of high-dimensional covariance matrices is an important and equally challenging problem in statistical methodology with applications in various fields. Motivated by the observation that even in cases where the ratio between dimension and sample size is as small as 0.05 0.05 0.05 0.05 , tests based on a fixed-dimension asymptotics do not keep their preassigned level, we propose to derive critical values of test statistics using an asymptotic regime where the dimension diverges at the same rate as the sample size.

This paper introduces a novel and well-founded statistical methodology for detecting change points in a sequence of high-dimensional covariance matrices. Our approach utilizes a min-type statistic based on a sequential process of likelihood ratio statistics. This is used to construct a test for the hypothesis of the existence of a change point with a corresponding estimator for its location. We provide theoretical guarantees for these inference tools by thoroughly analyzing the asymptotic properties of the sequential process of likelihood ratio statistics in the case where the dimension and sample size converge with the same rate to infinity. In particular, we prove weak convergence towards a Gaussian process under the null hypothesis of no change. To identify the challenging dependency structure between consecutive test statistics, we employ tools from random matrix theory and stochastic processes. Moreover, we show that the new test attains power under a class of alternatives reflecting changes in the bulk of the spectrum, and we prove consistency of the estimator for the change-point location.

1 Introduction

Having its origins in quality control (see Wald, 1945 ; Page, 1954 , for two early references) , change point detection has been an extremely active field of research until today with numerous applications in finance, genetics, seismology or sports to name just a few. In the last decade, a large part of the literature on change point detection considers the problem of detecting a change point in a high-dimensional sequence of means (see Jirak, 2015 ; Cho and Fryzlewicz, 2015 ; Dette and Gösmann, 2020 ; Enikeeva and Harchaoui, 2019 ; Liu et al., 2020 ; Liu, Gao and Samworth, 2021 ; Chen, Wang and Wu, 2022 ; Wang et al., 2022 ; Zhang, Wang and Shao, 2022 , among many others) .

Compared to the huge amount of work on the change-point problem for a sequence of high-dimensional means, the literature on the problem of detecting structural breaks in the corresponding covariance matrices is more scarce. For the low dimensional setting we refer to Chen and Gupta ( 2004 ) , Lavielle and Teyssiere ( 2006 ) , Galeano and Peña ( 2007 ) , Aue et al. ( 2009 ) and Dette and Wied ( 2016 ) , among others, who study different methods and aspects of the change point problem under the assumption that the sample size converges to infinity while the dimension is fixed. We also refer to Theorem 1.1.2 in Csörgő and Horváth ( 1997 ) who provide a test statistic and its asymptotic distribution under the null hypothesis for normal distributed data. However, even in cases where the ratio between dimension and sample size is rather small, it can be observed that statistical guarantees derived from fixed-dimension asymptotics can me be misleading. Exemplary, we display in Table 1 the simulated type I error of two commonly used tests for a change point in the in a sequence of covariance matrices. The first method (CH) is based on sequential likelihood ratio statistics, where the critical values have been determined by classical asymptotic arguments assuming that the dimension is fixed (see Theorem 1.1.2 in Csörgő and Horváth, 1997 ) . The second approach (AHHR) is a test proposed by Aue et al. ( 2009 ) , which is based on a quadratic form of the vectorized CUSUM statistic of the empirical covariance matrix. Again, the determination of critical values relies on fixed-dimensional asymptotics. We observe that even in the case where the ratio between the dimension and sample size is as small as 0.05 0.05 0.05 0.05 , the nominal level α = 0.05 𝛼 0.05 \alpha=0.05 italic_α = 0.05 of the CH test is exceeded by more than a factor of three. On the other hand, the AHHR test provides only a reasonable approximation of the nominal level, if the ratio between dimension and sample size is 0.025 0.025 0.025 0.025 . Note that this test requires the inversion of an estimate of high-dimensional covariance matrix and is only applicable if the sample size is larger than the squared dimension.

Dimension 5 10 15 20 25
Empirical level CH 0.05 0.16 0.39 0.82 1.00
AHHR 0.03 0.01 0.00 - -

Meanwhile, several authors have also discussed the problem of estimating a change point in a sequence of covariance matrices in the high-dimensional regime. For example, Avanesov and Buzun ( 2018 ) propose a multiscale approach to estimate multiple change points, while Wang, Yu and Rinaldo ( 2021 ) investigate the optimality of binary and wild binary segmentation for multiple change point detection. We also mention the work of Dette, Pan and Yang ( 2022 ) , who propose a two-stage approach to detect the location of a change point in a sequence of very high-dimensional covariance matrices. Li and Gao ( 2024 ) pursue a similar approach to develop a change-point test for high-dimensional correlation matrices.

With respect to testing, the literature on change point detection is rather scarce. In principle, one can develop change point analysis based on a vectorization of the covariance matrices using inference tools for a sequence of means. This approach essentially boils down to comparing the matrices before and after the change point with respect to a vector norm. However, in general, this approach does not yield an asymptotically distribution free test statistic. Moreover, as pointed out by Ryan and Killick ( 2023 ) , such distances do not reflect the geometry induced on the space of positive definite matrices. These authors propose a change-point test based on an alternative distance defined on the space of positive definite matrices, which compares sequentially the multivariate ratio 𝚺 1 − 1 ⁢ 𝚺 2 subscript superscript 𝚺 1 1 subscript 𝚺 2 \mathbf{\Sigma}^{-1}_{1}\mathbf{\Sigma}_{2} bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of the two covariance matrices 𝚺 1 subscript 𝚺 1 \mathbf{\Sigma}_{1} bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝚺 2 subscript 𝚺 2 \mathbf{\Sigma}_{2} bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT before and after a potential change point with the identity matrix. As a consequence, under the null hypothesis of no change point, their test statistic is independent of the underlying covariance structure, which makes it possible to derive quantiles for statistical testing in the regime where the dimension diverges at the same rate as the sample size. However, the approach of these authors is based on a combination of a point-wise limit theorem from random matrix theory with a Bonferroni correction. Therefore, as pointed out in Section 4 of Ryan and Killick ( 2023 ) , the resulting test may be conservative in applications. Moreover, this methodology is tailored to centered data, and it is demonstrated in Zheng, Bai and Yao ( 2015 ) , that an empirical centering introduces a non-negligible bias in the central limit theorem for the corresponding linear spectral statistic.

In this paper, we propose an alternative test for detecting a change point in a sequence of high-dimension covariance matrices, which takes the strong dependence between consecutive test statistics into account to avoid the drawbacks of previous works. Our approach is based on a sequential process of likelihood ratio test (LRT) statistics, where the dimension of the data is growing with the dimension at the same rate. We combine tools from random matrix theory and stochastic processes to develop and analyze statistical methodology for change point analysis in the covariance structure of high-dimensional data. Random matrix theory is a common tool to investigate asymptotic properties of LRT in high-dimensional scenarios for classical testing problems. An early reference in this direction is Jiang and Yang ( 2013 ) , who establish central limit theorems for several classical LRT statistics under the null hypotheses. Since this seminal work, numerous researchers have investigated related problems (see Jiang and Qi, 2015 ; Dette and Dörnemann, 2020 ; Bao et al., 2022 ; Dörnemann, 2023 ; Heiny and Parolya, 2023 ; Parolya, Heiny and Kurowicka, 2024 , among others) . None of these papers considers sequential LRT statistics to develop change point analysis. Moreover, these references do not contain any consistency results. In general, the body of literature concerning LRTs under alternative hypotheses in the high-dimensional regime remains sparse; for some exceptions, see Chen and Jiang ( 2018 ) , Bodnar, Dette and Parolya ( 2019 ) and Bai and Zhang ( 2023 ) . Having this line of literature in mind, we can summarize the main contributions of this paper.

Enhancing the power of the min-type statistic. We propose a novel methodology to test for a change point in a sequence of high-dimensional covariance matrices based on a minimum of sequential LRT statistics. Under the null hypothesis, this statistic admits a simple limiting distribution in the regime where the dimension diverges proportionally to the sample size. Unlike most other approaches, the new test does not require the estimation of the population covariance matrix. This result facilities the introduction of a simple asymptotic testing procedure with favorable finite-sample properties. Most notably, our approach takes the strong dependency structure between consecutive test statistics into account, whose analysis has been recognized as a challenging problem in the literature (see Ryan and Killick, 2023 ) , and which has not been addressed in previous works.

Power analysis and change-point estimation. We also study asymptotic properties of the sequential process under alternative hypotheses to identify conditions which guarantee the consistency of the test. The techniques developed in this paper are believed to be a valuable contribution to study consistency for other problems as well.

Besides developing a testing methodology with desirable properties, we also introduce a new estimator of the change-point location based on the LRT principle. Our comprehensive analysis of the process of LRT statistics enables us to provide theoretical guarantees under which the estimator admits consistency.

Mathematical challenges. Investigating sequential statistics introduces new challenges in random matrix theory compared to consideration of the common (non-sequential) LRT, namely (i) the convergence of the finite-dimensional distributions and (ii) the asymptotic tightness of the sequential log-LRT statistics. Indeed, the weak convergence result implied by (i) and (ii) is a novel, technically challenging contribution, given that sequential LRT statistics have not been studied in such a functional and high-dimensional framework before.

  • subscript 𝑡 1 subscript 𝑡 2 0 1 t_{1},t_{2}\in[0,1] italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ [ 0 , 1 ] , the corresponding LRT statistics are highly correlated, and a nuanced analysis is required in order to determine their covariance. Regarding (ii), we show asymptotic equicontinuity of the sequential log-LRT statistics by deriving uniform inequalities for the moments of the corresponding increments.

The remaining part of this work is structured as follows. In Section 2 , we present the new method to detect and estimate a change-point in a high-dimensional covariance structure, and provide the main theoretical guarantees. In numerical experiments given in Section 3 , we compare the finite-sample size properties of our test as well as the change-point estimator to other approaches. The proofs of our theoretical results are deferred to Section 4 and the supplementary material.

2 Change point analysis by a sequential LRT process

(2.1)
(2.2)

where the location t ⋆ ∈ ( t 0 , 1 − t 0 ) superscript 𝑡 ⋆ subscript 𝑡 0 1 subscript 𝑡 0 t^{\star}\in(t_{0},1-t_{0}) italic_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∈ ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) of the change point is unknown and t 0 > 0 subscript 𝑡 0 0 t_{0}>0 italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0 is a positive constant. We define

(2.3)
(2.4)

as the sample covariance matrix calculated from the full sample and consider the statistic

(2.5)

of the unobserved random variable x 11 subscript 𝑥 11 x_{11} italic_x start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , which can be represented by formula (9.8.6) in Bai and Silverstein ( 2010 ) in the form

(2.6)

For its estimation, we therefore introduce the quantities

and define the estimator

(2.7)
(2.8)

respectively. With these quantities we consider the standardized LRT and define the min-type statistic

Under the alternative H 1 subscript 𝐻 1 H_{1} italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , we will show that (under appropriate assumptions) that

(2.9)
(2.10)

By the log-concavity of the mapping 𝐀 → log ⁡ | 𝐀 | → 𝐀 𝐀 \mathbf{A}\to\log|\mathbf{A}| bold_A → roman_log | bold_A | defined on the set of positive definite matrices we have E n ≤ 0 subscript 𝐸 𝑛 0 E_{n}\leq 0 italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ 0 with equality if and only if 𝚺 1 = 𝚺 n subscript 𝚺 1 subscript 𝚺 𝑛 \mathbf{\Sigma}_{1}=\mathbf{\Sigma}_{n} bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (which means that there is no change point). Therefore, we propose to reject the null hypothesis ( 2.1 ) in favor of ( 2.2 ) for small values of M n cen superscript subscript 𝑀 𝑛 cen M_{n}^{\textnormal{cen}} italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT cen end_POSTSUPERSCRIPT . To find corresponding quantiles, we first determine the asymptotic distribution of the test statistic M n cen superscript subscript 𝑀 𝑛 cen M_{n}^{\textnormal{cen}} italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT cen end_POSTSUPERSCRIPT under the null hypothesis in the high-dimensional regime, where the dimension diverges at the same rate as the sample size. For this purpose, we make the following assumptions.

y n = p / n → y ∈ ( 0 , 1 ) subscript 𝑦 𝑛 𝑝 𝑛 → 𝑦 0 1 y_{n}=p/n\to y\in(0,1) italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_p / italic_n → italic_y ∈ ( 0 , 1 ) as n → ∞ → 𝑛 n\to\infty italic_n → ∞ such that y < t 0 ∧ ( 1 − t 0 ) 𝑦 subscript 𝑡 0 1 subscript 𝑡 0 y<t_{0}\wedge(1-t_{0}) italic_y < italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∧ ( 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) for some t 0 ∈ ( 0 , 1 ) subscript 𝑡 0 0 1 t_{0}\in(0,1) italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ ( 0 , 1 ) .

4 𝛿 \mathbb{E}|x_{11}|^{4+\delta}<\infty blackboard_E | italic_x start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 4 + italic_δ end_POSTSUPERSCRIPT < ∞ for some δ > 0 𝛿 0 \delta>0 italic_δ > 0 .

We have uniformly with respect to n ∈ ℕ 𝑛 ℕ n\in\mathbb{N} italic_n ∈ blackboard_N

Theorem 1 .

If Assumption (A1) , (A2) with δ > 4 𝛿 4 \delta>4 italic_δ > 4 and Assumption (A3) are satisfied, then we have under H 0 subscript 𝐻 0 H_{0} italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

(2.11)

where ( Z ⁢ ( t ) ) t ∈ [ t 0 , 1 − t 0 ] subscript 𝑍 𝑡 𝑡 subscript 𝑡 0 1 subscript 𝑡 0 (Z(t))_{t\in[t_{0},1-t_{0}]} ( italic_Z ( italic_t ) ) start_POSTSUBSCRIPT italic_t ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT denotes a centered Gaussian process with covariance kernel

(2.12)

for t 0 ≤ t 1 ≤ t 2 ≤ 1 − t 0 . subscript 𝑡 0 subscript 𝑡 1 subscript 𝑡 2 1 subscript 𝑡 0 t_{0}\leq t_{1}\leq t_{2}\leq 1-t_{0}. italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT .

A proof of this result can be found in Section 4.1 . Note that the limiting distribution in ( 2.11 ) contains no nuisance parameters. Consequently, if q α subscript 𝑞 𝛼 q_{\alpha} italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT denotes the α 𝛼 \alpha italic_α -quantile of the limit distribution, the decision rule, which rejects the null hypothesis in ( 2.1 ), whenever

(2.13)

The matrices 𝚺 1 subscript 𝚺 1 \mathbf{\Sigma}_{1} bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝚺 n subscript 𝚺 𝑛 \mathbf{\Sigma}_{n} bold_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in ( 2.2 ) are simultaneously diagonalizable, that is, there exists an orthogonal matrix 𝐐 ∈ ℝ p × p 𝐐 superscript ℝ 𝑝 𝑝 \mathbf{Q}\in\mathbb{R}^{p\times p} bold_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_p end_POSTSUPERSCRIPT such that

(2.14)

where λ 1 ⁢ ( 𝚺 1 ) ≥ … ≥ λ p ⁢ ( 𝚺 1 ) subscript 𝜆 1 subscript 𝚺 1 … subscript 𝜆 𝑝 subscript 𝚺 1 \lambda_{1}(\mathbf{\Sigma}_{1})\geq\ldots\geq\lambda_{p}(\mathbf{\Sigma}_{1}) italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≥ … ≥ italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and λ 1 ⁢ ( 𝚺 n ) ≥ … ≥ λ p ⁢ ( 𝚺 n ) subscript 𝜆 1 subscript 𝚺 𝑛 … subscript 𝜆 𝑝 subscript 𝚺 𝑛 \lambda_{1}(\mathbf{\Sigma}_{n})\geq\ldots\geq\lambda_{p}(\mathbf{\Sigma}_{n}) italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ … ≥ italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( bold_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) denote the ordered eigenvalues of the matrices 𝚺 1 subscript 𝚺 1 \mathbf{\Sigma}_{1} bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝚺 n subscript 𝚺 𝑛 \mathbf{\Sigma}_{n} bold_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , respectively. Suppose that these eigenvalues satisfy (uniformly with respect to n ∈ ℕ 𝑛 ℕ n\in\mathbb{N} italic_n ∈ blackboard_N )

(2.15)

where t ⋆ superscript 𝑡 ⋆ t^{\star} italic_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT denotes the change point in ( 2.2 ) and E n ⁢ ( t ∗ ) subscript 𝐸 𝑛 superscript 𝑡 E_{n}(t^{*}) italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is defined in ( 2.10 ).

Our next result shows that under this additional assumption, the test defined in ( 2.13 ) is consistent. A proof of this result can be found in Section A.1 .

Theorem 2 .

Suppose that Assumption (A1) and (A2) with δ > 0 𝛿 0 \delta>0 italic_δ > 0 are satisfied. Then, we have under Assumption (A4)

The theoretical guarantees for the test ( 2.11 ) are summarized in the following corollary, which is an immediate consequence of Theorem 1 and 2 .

Corollary 1 .

Assume that Assumptions (A1) and (A2) are satisfied.

(Level Control) Suppose that Assumption (A3) holds and that the parameter δ 𝛿 \delta italic_δ in Assumption (A2) satisfies δ > 4 . 𝛿 4 \delta>4. italic_δ > 4 . Then, we have under H 0 subscript 𝐻 0 H_{0} italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

(Consistency) Suppose that Assumption (A4) holds. Then, we have

If H 0 subscript 𝐻 0 H_{0} italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is rejected by the test ( 2.13 ), it is natural to ask for the location of the change point. For this purpose, we propose the following estimator.

(2.16)

To ensure the convergence of this change-point estimator, we need the following assumption on the empirical spectral distributions of the matrices 𝚺 1 subscript 𝚺 1 \mathbf{\Sigma}_{1} bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝚺 n subscript 𝚺 𝑛 \mathbf{\Sigma}_{n} bold_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT under the alternative H 1 subscript 𝐻 1 H_{1} italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

The measure

converges weakly to some measure G 𝐺 G italic_G .

Theorem 3 .

Suppose that Assumption (A1) and (A2) with δ > 0 𝛿 0 \delta>0 italic_δ > 0 are satisfied and that Assumption (A4) and (A1) hold. Then, we have under the alternative

The proof can found in Section A.2 .

(2) The condition ( 2.15 ) ensures that the sequential likelihood ratio test statistic is consistent under the alternative. In fact, by ( 2.9 ) (which holds under under Assumption (A4) ), the term E n ⁢ ( t ∗ ) subscript 𝐸 𝑛 superscript 𝑡 E_{n}(t^{*}) italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) determines the power of the test ( 2.13 ). This condition is akin to conditions (1.7) and (1.11) in Chen and Jiang ( 2018 ) , ensuring the consistency of likelihood ratio tests for other testing problems.

3 Finite-sample properties

3.1 data-scientific aspects.

𝑝 𝑛 0.05 0.2 t_{0}>(p/n+0.05)\vee 0.2 italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > ( italic_p / italic_n + 0.05 ) ∨ 0.2 . If the user is primarily interested in estimating the change point location, they may select t 0 subscript 𝑡 0 t_{0} italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT closer to the critical threshold p / n ∨ ( 1 − p / n ) 𝑝 𝑛 1 𝑝 𝑛 p/n\vee(1-p/n) italic_p / italic_n ∨ ( 1 - italic_p / italic_n ) .

Non-simultaneously diagonalizable matrices 𝚺 1 subscript 𝚺 1 \mathbf{\Sigma}_{1} bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝚺 n subscript 𝚺 𝑛 \mathbf{\Sigma}_{n} bold_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT : As explained in Remark 1 , the technical assumption ( 2.14 ) allows us to derive theoretical guarantees of the proposed methodology under a structural break in the covariance structure. In our numerical study, we found that the change point detection and estimation is robust even under a larger class of alternatives. For instance, in the right panels of Figure 1 below, we report results for non-normally distributed data and randomly generated covariance matrices under the alternative with prescribed eigenvalues, see model ( 3.2 ).

3.2 Numerical experiments for change-point detection

In the following, we provide numerical results on the performance of the new test ( 2.13 ) in comparison to the test proposed by Ryan and Killick ( 2023 ) . All reported results are based on 500 500 500 500 simulation runs, and the nominal level is α = 0.05 𝛼 0.05 \alpha=0.05 italic_α = 0.05 . The change-point location is chosen as t ⋆ = 0.5 superscript 𝑡 ⋆ 0.5 t^{\star}=0.5 italic_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = 0.5 .

We first consider independent standard normal distributed entries ( x 11 ∼ 𝒩 ⁢ ( 0 , 1 ) similar-to subscript 𝑥 11 𝒩 0 1 x_{11}\sim\mathcal{N}(0,1) italic_x start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 1 ) ) in the matrix 𝐗 𝐗 \mathbf{X} bold_X and

(3.1)

Next, we investigate the robustness of the new method with respect to the assumption of simultaneously diagonalizable covariance matrices before and after the change point. For this purpose, we consider the case where the matrices 𝚺 1 subscript 𝚺 1 \mathbf{\Sigma}_{1} bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝚺 n subscript 𝚺 𝑛 \mathbf{\Sigma}_{n} bold_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT before and after the change point are randomly generated with a prescribed spectrum, that is

(3.2)

Refer to caption

3.3 Numerical experiments for the change-point estimation

In this section, we compare the new change point estimator τ ^ ⋆ superscript ^ 𝜏 ⋆ \hat{\tau}^{\star} over^ start_ARG italic_τ end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT in ( 2.16 ) with the estimators proposed by Aue et al. ( 2009 ) (AHHR) and Ryan and Killick ( 2023 ) (RK). All results are again based on 500 500 500 500 simulation runs.

𝑝 1 𝑝 2 1 1276 (p+1)p/2+1=1276 ( italic_p + 1 ) italic_p / 2 + 1 = 1276 to calculate this estimator (some results for the AHHR estimator can be found in Table 4 ). We observe from the upper part of Table 2 that the new estimator ( 2.16 ) outperforms the RK estimator in all three cases under consideration. The smaller mean squared error of the new estimator ( 2.16 ) is caused by both a smaller bias and variance. In particular, the RK estimator admits a significant bias for moderately strong signals δ ≈ 1.5 . 𝛿 1.5 \delta\approx 1.5. italic_δ ≈ 1.5 .

In Table 3 , we display the results of the two estimators for model ( 3.2 ) with uniformly distributed data. The results are similar to those presented in Table 2 for model ( 3.1 ). It is noteworthy that the new estimator performs well, even though the technical assumption (A4) is not satisfied. Again, our method outperforms the alternative RK approach in terms of smaller mean squared error.

𝑝 1 2 1 p(p+1)/2+1 italic_p ( italic_p + 1 ) / 2 + 1 and for this reason, we consider the cases ( n , p ) = ( 200 , 10 ) 𝑛 𝑝 200 10 (n,p)=(200,10) ( italic_n , italic_p ) = ( 200 , 10 ) (top), ( n , p ) = ( 200 , 15 ) 𝑛 𝑝 200 15 (n,p)=(200,15) ( italic_n , italic_p ) = ( 200 , 15 ) (bottom). As the dimension is relatively small compared to the sample size, we choose t 0 = 0.1 subscript 𝑡 0 0.1 t_{0}=0.1 italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.1 . We observe that, even in such cases, the estimator AHHR admits a significant bias resulting in a larger MSE compared to the other two methods. Interestingly, the bias of RK increases as the signal strength δ 𝛿 \delta italic_δ increases from moderately to large values. In contrast, the new estimator τ ^ ⋆ superscript ^ 𝜏 ⋆ \hat{\tau}^{\star} over^ start_ARG italic_τ end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT has decreasing bias and standard deviation as δ 𝛿 \delta italic_δ increases. Moreover, the new estimator always outperforms RK indicated by a smaller mean squared error, and AHHR in the case ( n , p ) = ( 200 , 15 ) . 𝑛 𝑝 200 15 (n,p)=(200,15). ( italic_n , italic_p ) = ( 200 , 15 ) . For ( n , p ) = ( 200 , 10 ) , 𝑛 𝑝 200 10 (n,p)=(200,10), ( italic_n , italic_p ) = ( 200 , 10 ) , we observe that the mean squared error of AHHR is smaller for weak signal strength δ . 𝛿 \delta. italic_δ . However, even for large δ 𝛿 \delta italic_δ , this method admits a significant bias and is therefore outperformed by our method.

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
mean 0.499 0.499 0.506 0.493 0.496 0.494 0.497 0.496 0.499 0.499
sd 0.123 0.112 0.091 0.066 0.054 0.034 0.025 0.018 0.010 0.008
MSE 0.015 0.012 0.008 0.004 0.003 0.001 0.001 0.000 0.000 0.000
RK mean 0.529 0.497 0.425 0.422 0.436 0.462 0.473 0.479 0.487 0.491
sd 0.203 0.198 0.157 0.105 0.073 0.048 0.042 0.033 0.023 0.016
MSE 0.042 0.039 0.030 0.017 0.009 0.004 0.002 0.001 0.001 0.000
mean 0.507 0.499 0.497 0.496 0.498 0.496 0.497 0.498 0.497 0.498
sd 0.129 0.111 0.097 0.073 0.055 0.039 0.024 0.016 0.011 0.008
MSE 0.017 0.012 0.009 0.005 0.003 0.002 0.001 0.000 0.000 0.000
RK mean 0.515 0.501 0.420 0.395 0.413 0.431 0.450 0.460 0.468 0.474
sd 0.211 0.210 0.161 0.096 0.074 0.058 0.044 0.040 0.033 0.026
MSE 0.045 0.044 0.032 0.020 0.013 0.008 0.004 0.003 0.002 0.001
mean 0.495 0.495 0.494 0.494 0.496 0.498 0.499 0.499 0.499 0.500
sd 0.124 0.106 0.082 0.055 0.038 0.023 0.013 0.010 0.006 0.005
MSE 0.015 0.011 0.007 0.003 0.001 0.001 0.000 0.000 0.000 0.000
RK mean 0.551 0.486 0.414 0.404 0.429 0.451 0.464 0.471 0.476 0.483
sd 0.204 0.199 0.136 0.077 0.057 0.038 0.033 0.027 0.024 0.019
MSE 0.044 0.040 0.026 0.015 0.008 0.004 0.002 0.002 0.001 0.001
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
mean 0.507 0.499 0.498 0.496 0.498 0.500 0.499 0.500 0.500 0.500
sd 0.123 0.087 0.055 0.033 0.009 0.006 0.004 0.003 0.003 0.001
MSE 0.015 0.008 0.003 0.001 0.000 0.000 0.000 0.000 0.000 0.000
RK mean 0.548 0.518 0.454 0.473 0.490 0.495 0.497 0.498 0.498 0.498
sd 0.195 0.193 0.106 0.048 0.021 0.012 0.007 0.006 0.006 0.007
MSE 0.040 0.037 0.013 0.003 0.001 0.000 0.000 0.000 0.000 0.000
mean 0.496 0.501 0.497 0.498 0.497 0.499 0.500 0.500 0.500 0.500
sd 0.118 0.091 0.056 0.033 0.018 0.008 0.004 0.004 0.002 0.001
MSE 0.014 0.008 0.003 0.001 0.000 0.000 0.000 0.000 0.000 0.000
RK mean 0.548 0.505 0.428 0.458 0.481 0.490 0.495 0.496 0.497 0.498
sd 0.198 0.199 0.109 0.052 0.032 0.019 0.010 0.008 0.008 0.005
MSE 0.041 0.039 0.017 0.004 0.001 0.000 0.000 0.000 0.000 0.000
mean 0.495 0.501 0.499 0.501 0.499 0.499 0.500 0.500 0.500 0.500
sd 0.106 0.075 0.036 0.015 0.007 0.006 0.001 0.002 0.001 0.001
MSE 0.011 0.006 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000
RK mean 0.590 0.475 0.443 0.474 0.492 0.496 0.498 0.498 0.499 0.499
sd 0.187 0.176 0.069 0.038 0.016 0.009 0.006 0.004 0.003 0.003
MSE 0.043 0.032 0.008 0.002 0.000 0.000 0.000 0.000 0.000 0.000
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
mean 0.497 0.491 0.475 0.452 0.446 0.428 0.421 0.415 0.415 0.407
sd 0.167 0.152 0.150 0.133 0.124 0.111 0.081 0.080 0.066 0.048
MSE 0.037 0.031 0.028 0.020 0.017 0.013 0.007 0.007 0.005 0.002
RK mean 0.475 0.455 0.454 0.417 0.401 0.359 0.358 0.367 0.345 0.353
sd 0.287 0.278 0.272 0.247 0.231 0.201 0.176 0.155 0.128 0.114
MSE 0.088 0. 080 0.077 0.061 0.053 0.042 0.033 0.025 0.020 0.015
AHHR mean 0.512 0.516 0.507 0.512 0.506 0.495 0.486 0.486 0.482 0.478
sd 0.087 0.088 0.085 0.083 0.079 0.076 0.071 0.071 0.073 0.068
MSE 0.020 0.021 0.019 0.019 0.017 0.015 0.012 0.012 0.012 0.011
mean 0.507 0.482 0.468 0.452 0.447 0.432 0.420 0.419 0.414 0.410
sd 0.158 0.159 0.147 0.141 0.132 0.109 0.090 0.082 0.068 0.058
MSE 0.036 0.032 0.026 0.022 0.020 0.013 0.009 0.007 0.005 0.003
RK mean 0.483 0.463 0.451 0.409 0.407 0.379 0.365 0.358 0.354 0.355
sd 0.294 0.301 0.293 0.275 0.255 0.237 0.211 0.196 0.183 0.165
MSE 0.093 0.095 0.088 0.076 0.065 0.057 0.046 0.040 0.036 0.029
AHHR mean 0.505 0.507 0.507 0.503 0.502 0.499 0.504 0.495 0.499 0.491
sd 0.066 0.061 0.060 0.056 0.060 0.058 0.056 0.057 0.052 0.058
MSE 0.015 0.015 0.015 0.014 0.014 0.013 0.014 0.012 0.012 0.012

4 Proofs of main results under the null hypothesis

4.1 proof of theorem 1.

Consider the sequential likelihood ratio statistics

(4.1)

and the corresponding centered process

where the centering term is defined as

In the following theorem, we provide the convergence of the finite-dimensional distributions of ( 𝚲 n ) n ∈ ℕ . subscript subscript 𝚲 𝑛 𝑛 ℕ (\mathbf{\Lambda}_{n})_{n\in\mathbb{N}}. ( bold_Λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT .

Theorem 4 .

where ( Z ⁢ ( t ) ) t ∈ [ t 0 , 1 − t 0 ] subscript 𝑍 𝑡 𝑡 subscript 𝑡 0 1 subscript 𝑡 0 (Z(t))_{t\in[t_{0},1-t_{0}]} ( italic_Z ( italic_t ) ) start_POSTSUBSCRIPT italic_t ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT denotes the Gaussian process defined in Theorem 1 .

The asymptotic tightness of ( 𝚲 n ) n ∈ ℕ subscript subscript 𝚲 𝑛 𝑛 ℕ (\mathbf{\Lambda}_{n})_{n\in\mathbb{N}} ( bold_Λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT is given in the next theorem.

Theorem 5 .

Suppose that Assumptions (A1) , (A2) with δ > 4 𝛿 4 \delta>4 italic_δ > 4 and (A3) are satisfied, and that 𝔼 ⁢ [ x 11 ] = 0 𝔼 delimited-[] subscript 𝑥 11 0 \mathbb{E}[x_{11}]=0 blackboard_E [ italic_x start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT ] = 0 . Then, the sequence ( 𝚲 n ) n ∈ ℕ subscript subscript 𝚲 𝑛 𝑛 ℕ (\mathbf{\Lambda}_{n})_{n\in\mathbb{N}} ( bold_Λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT is asymptotically tight in the space ℓ ∞ ⁢ ( [ t 0 , 1 − t 0 ] ) . superscript ℓ subscript 𝑡 0 1 subscript 𝑡 0 \ell^{\infty}([t_{0},1-t_{0}]). roman_ℓ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] ) .

Proofs of these statements can be found in Section 4.1.1 and 4.1.3 , respectively. Then, the weak convergence of ( 𝚲 n ) n ∈ ℕ subscript subscript 𝚲 𝑛 𝑛 ℕ (\mathbf{\Lambda}_{n})_{n\in\mathbb{N}} ( bold_Λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT towards a Gaussian process follows from the convergence of the finite-dimensional distributions (Theorem 4 ) and the tightness result (Theorem 5 ).

Corollary 2 .

Suppose that assumptions (A1) , (A2) with δ > 4 𝛿 4 \delta>4 italic_δ > 4 , (A3) are satisfied, and that 𝔼 ⁢ [ x 11 ] = 0 𝔼 delimited-[] subscript 𝑥 11 0 \mathbb{E}[x_{11}]=0 blackboard_E [ italic_x start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT ] = 0 . Then, we have under the null hypothesis H 0 subscript 𝐻 0 H_{0} italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT of no change point

where ( Z ⁢ ( t ) ) t ∈ [ t 0 , 1 − t 0 ] subscript 𝑍 𝑡 𝑡 subscript 𝑡 0 1 subscript 𝑡 0 (Z(t))_{t\in[t_{0},1-t_{0}]} ( italic_Z ( italic_t ) ) start_POSTSUBSCRIPT italic_t ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT denotes the centered Gaussian process defined in Theorem 1 .

Before continuing with the proof of Theorem 1 , we comment on the integration of our theoretical result in the existing line of literature.

  • 𝑛 𝑡 𝑡 (\Lambda_{n,t})_{t} ( roman_Λ start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , let alone considering the process convergence.
  • 𝑛 𝑡 𝑡 subscript 𝑡 0 1 subscript 𝑡 0 (\log\Lambda_{n,t})_{t\in[t_{0},1-t_{0}]} ( roman_log roman_Λ start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT .

With these preparations, we are in a position to prove Theorem 1 .

Proof of Theorem 1 .

1 𝑝 𝑛 𝑜 1 \log\left(1-\overline{\mathbf{y}}^{\top}\hat{\mathbf{\Sigma}}^{-1}\overline{% \mathbf{y}}\right)=\log(1-p/n)+o(1) roman_log ( 1 - over¯ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG bold_Σ end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG bold_y end_ARG ) = roman_log ( 1 - italic_p / italic_n ) + italic_o ( 1 ) almost surely (see Section 4.3.1 in Heiny and Parolya, 2023 ) . Thus, we obtain

(4.2)

Similarly, one can show that

(4.3)
(4.4)
(4.5)

which follows by a Taylor expansion. Then, we calculate

(4.6)
(4.7)

Next, we aim to show that to show that

(4.8)

is asymptotically tight. Note that

(4.9)

almost surely. By Theorem 5 and ( 4.9 ), we conclude that ( 4.8 ) is asymptotically tight. Combining this with ( 4.7 ), it follows from Theorem 1.5.4 on Van Der Vaart and Wellner ( 1996 ) that

The proof of Theorem 1 concludes by an application of the continuous mapping theorem. ∎

4.1.1 Proof of Theorem 4 - weak convergence of finite-dimensional distributions

In the following, we prove Theorem 4 , and the necessary auxiliary results are stated in Section 4.1.2 .

Proof of Theorem 4 .

For the sake of convenience, we restrict ourselves to the case k = 2 . 𝑘 2 k=2. italic_k = 2 . Then, using the Cramér–Wold theorem, it suffices to show that

Moreover, let 𝐏 ( i ; j : k ) \mathbf{P}(i;j:k) bold_P ( italic_i ; italic_j : italic_k ) for 1 ≤ i ≤ p 1 𝑖 𝑝 1\leq i\leq p 1 ≤ italic_i ≤ italic_p and 1 ≤ j ≤ k ≤ n 1 𝑗 𝑘 𝑛 1\leq j\leq k\leq n 1 ≤ italic_j ≤ italic_k ≤ italic_n denote the projection matrix on the orthogonal complement of

𝑘 𝑗 1 𝑖 \mathbf{X}_{i,j:k}=(\mathbf{b}_{1,j:k},\ldots,\mathbf{b}_{i,j:k})^{\top}_{\in% \mathbb{R}^{k\times(k-j+1)\times i}}, bold_X start_POSTSUBSCRIPT italic_i , italic_j : italic_k end_POSTSUBSCRIPT = ( bold_b start_POSTSUBSCRIPT 1 , italic_j : italic_k end_POSTSUBSCRIPT , … , bold_b start_POSTSUBSCRIPT italic_i , italic_j : italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × ( italic_k - italic_j + 1 ) × italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , then

𝑛 𝑡 1 𝑛 top \mathbf{X}_{1:\lfloor nt\rfloor}^{\top},~{}\mathbf{X}_{(\lfloor nt\rfloor+1):n% }^{\top} bold_X start_POSTSUBSCRIPT 1 : ⌊ italic_n italic_t ⌋ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT ( ⌊ italic_n italic_t ⌋ + 1 ) : italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and 𝐗 n ⊤ superscript subscript 𝐗 𝑛 top \mathbf{X}_{n}^{\top} bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (see (Wang, Han and Pan, 2018 , Section 2) for more details), respectively, we have

Thus, under the null hypothesis of no change point, the likelihood ratio statistic does not depend on 𝚺 𝚺 \mathbf{\Sigma} bold_Σ and we may write

(4.10)
(4.11)

Next, we define for 1 ≤ i ≤ p 1 𝑖 𝑝 1\leq i\leq p 1 ≤ italic_i ≤ italic_p and t ∈ { t 1 , t 2 } 𝑡 subscript 𝑡 1 subscript 𝑡 2 t\in\{t_{1},t_{2}\} italic_t ∈ { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }

(4.12)
(4.13)
(4.14)
(4.15)

Using Stirling’s formula

a straightforward calculation gives

(4.16)
(4.17)

Combining ( 4.11 ) and ( 4.16 ) gives the representation

where we applied Lemma 2 and Lemma 1 for the last estimate, which are given in Section 4.1.2 below. Defining

(4.18)

it remains to show that

(4.19)
(4.20)
(4.21)

By the CLT for martingale differences (see, for example, Corollary 3.1 in Hall and Heyde, 1980 ) , these statements imply ( 4.19 ). Regarding ( 4.21 ), we have, by Lemma B.26 in Bai and Silverstein ( 2010 ) , for ε > 0 𝜀 0 \varepsilon>0 italic_ε > 0

(4.22)

In particular, we have 𝐏 j 1 : k 1 ( i − 1 ; j 1 : k 1 ) = 𝐏 ( i − 1 ; j 1 : k 1 ) \mathbf{P}^{j_{1}:k_{1}}(i-1;j_{1}:k_{1})=\mathbf{P}(i-1;j_{1}:k_{1}) bold_P start_POSTSUPERSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_i - 1 ; italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = bold_P ( italic_i - 1 ; italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and

(4.23)

subscript 𝑘 2 subscript 𝑗 2 1 k_{2}-j_{2}+1=(k_{1}-j_{1}+1)\vee(k_{2}-j_{2}+1) italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 = ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 ) ∨ ( italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 )

(4.24)

subscript 𝑘 2 subscript 𝑗 2 1 k_{2}-j_{2}+1=(k_{1}-j_{1}+1)\vee(k_{2}-j_{2}+1) italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 = ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 ) ∨ ( italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 ) , it follows from Lemma 3 in Section 4.1.2 below that

(4.25)

𝑛 subscript 𝑡 2 1 𝑛 0 \sigma^{2}(1,\lfloor nt_{1}\rfloor,\lfloor nt_{2}\rfloor+1,n)=0 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 , ⌊ italic_n italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⌋ , ⌊ italic_n italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⌋ + 1 , italic_n ) = 0 . Combining ( 4.23 ) and ( 4.25 ) gives

𝑛 subscript 𝑡 1 1 𝑛 1 𝑛 subscript 𝑡 2 \sigma_{1}^{2}(\lfloor nt_{1}\rfloor+1,n,1,\lfloor nt_{2}\rfloor) italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ⌊ italic_n italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⌋ + 1 , italic_n , 1 , ⌊ italic_n italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⌋ ) . For all remaining σ 1 2 superscript subscript 𝜎 1 2 \sigma_{1}^{2} italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT -terms, we use ( 4.24 ) and obtain

If t 1 = t 2 = t subscript 𝑡 1 subscript 𝑡 2 𝑡 t_{1}=t_{2}=t italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_t , then we get

4.1.2 Auxiliary results for the proof of Theorem 4

The convergence of the finite-dimensional distributions is facilitated by the following auxiliary results, whose proofs are postponed to the supplementary material. To begin with, we have an result on the quadratic term appearing in the expansion of the test statistic.

As n → ∞ , → 𝑛 n\to\infty, italic_n → ∞ , it holds for t ∈ [ t 0 , 1 − t 0 ] 𝑡 subscript 𝑡 0 1 subscript 𝑡 0 t\in[t_{0},1-t_{0}] italic_t ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ]

The following result shows that the logarithmic terms are negligible at a δ 𝛿 \delta italic_δ -dependent rate. It will also be used in Section 4.1.3 when the proving the asymptotic tightness given in Theorem 5 .

Assume that (A1) and (A2) with some δ > 0 𝛿 0 \delta>0 italic_δ > 0 are satisfied. Then, it holds for all t ∈ [ t 0 , 1 − t 0 ] 𝑡 subscript 𝑡 0 1 subscript 𝑡 0 t\in[t_{0},1-t_{0}] italic_t ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ]

𝑛 𝑡 1 𝑛 Y_{i,(\lfloor nt\rfloor+1):n} italic_Y start_POSTSUBSCRIPT italic_i , ( ⌊ italic_n italic_t ⌋ + 1 ) : italic_n end_POSTSUBSCRIPT are defined in ( 4.14 ) and ( 4.15 ), respectively.

In the following lemma, we provide an approximation for σ 2 2 superscript subscript 𝜎 2 2 \sigma_{2}^{2} italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT appearing in ( 4.25 ).

subscript 𝑘 2 subscript 𝑗 2 1 k_{2}-j_{2}+1=(k_{1}-j_{1}+1)\vee(k_{2}-j_{2}+1) italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 = ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 ) ∨ ( italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 ) . It holds

Moreover, we have for t 0 ≤ t 1 < t 2 ≤ t 0 subscript 𝑡 0 subscript 𝑡 1 subscript 𝑡 2 subscript 𝑡 0 t_{0}\leq t_{1}<t_{2}\leq t_{0} italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

We conclude this section by an approximation of σ 1 2 superscript subscript 𝜎 1 2 \sigma_{1}^{2} italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT defined below ( 4.23 ).

If t 1 < t 2 subscript 𝑡 1 subscript 𝑡 2 t_{1}<t_{2} italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , then we have

4.1.3 Proof of Theorem 5 - asymptotic tightness of ( 𝚲 n ) n ∈ ℕ subscript subscript 𝚲 𝑛 𝑛 ℕ (\mathbf{\Lambda}_{n})_{n\in\mathbb{N}} ( bold_Λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT

for some d > 0 . 𝑑 0 d>0. italic_d > 0 .

Next, we need a uniform result on the quadratic terms, which is provided in the next lemma.

(4.26)
(4.27)

Finally, we recall Lemma 2 given in Section 4.1.2 on the logarithmic terms. Using these auxiliary results, we are in the position to give a proof of Theorem 5 .

Proof of Theorem 5 .

Note that under the moment assumption (A2) with δ > 4 𝛿 4 \delta>4 italic_δ > 4 , we have by Lemma 2 and Lemma 6

(4.28)

for some d > 0 𝑑 0 d>0 italic_d > 0 , which may be chosen such that it coincides with the d > 0 𝑑 0 d>0 italic_d > 0 from Lemma 5 . Note that if t − r < 1 / n 𝑡 𝑟 1 𝑛 t-r<1/n italic_t - italic_r < 1 / italic_n , we have ⌊ n ⁢ r ⌋ = ⌊ n ⁢ s ⌋ 𝑛 𝑟 𝑛 𝑠 \lfloor nr\rfloor=\lfloor ns\rfloor ⌊ italic_n italic_r ⌋ = ⌊ italic_n italic_s ⌋ or ⌊ n ⁢ s ⌋ = ⌊ n ⁢ t ⌋ 𝑛 𝑠 𝑛 𝑡 \lfloor ns\rfloor=\lfloor nt\rfloor ⌊ italic_n italic_s ⌋ = ⌊ italic_n italic_t ⌋ , and thus, m ⁢ ( r , s , t ) = 0 𝑚 𝑟 𝑠 𝑡 0 m(r,s,t)=0 italic_m ( italic_r , italic_s , italic_t ) = 0 almost surely. If t − r ≥ 1 / n , 𝑡 𝑟 1 𝑛 t-r\geq 1/n, italic_t - italic_r ≥ 1 / italic_n , it holds for all λ > 0 𝜆 0 \lambda>0 italic_λ > 0 by Lemma 5 and ( 4.28 ),

(4.29)

Similarly, we get

(4.30)

Combining ( 4.29 ) and ( 4.30 ) with Corollary A.4 in Dette and Tomecki ( 2019 ) , we have for ⌊ m ⁢ t 0 ⌋ ≤ j ≤ ⌊ m ⁢ ( 1 − t 0 ) ⌋ 𝑚 subscript 𝑡 0 𝑗 𝑚 1 subscript 𝑡 0 \lfloor mt_{0}\rfloor\leq j\leq\lfloor m(1-t_{0})\rfloor ⌊ italic_m italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⌋ ≤ italic_j ≤ ⌊ italic_m ( 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⌋

(4.31)

This implies

This work was partially supported by the DFG Research unit 5381 Mathematical Statistics in the Information Age , project number 460867398, and the Aarhus University Research Foundation (AUFF).

Supplementary Material \sdescription The supplementary material contains the proofs of the main results under the alternative hypothesis (Theorem 2 and Theorem 3 ) and its auxiliary results. Moreover, we provide the proofs for auxiliary results needed in the proof of Theorem 1 (Lemmas 1 , 3 and 4 ).

  • Aue et al. (2009) {barticle} [author] \bauthor \bsnm Aue,  \bfnm Alexander \binits A., \bauthor \bsnm Hörmann,  \bfnm Siegfried \binits S., \bauthor \bsnm Horváth,  \bfnm Lajos \binits L. and \bauthor \bsnm Reimherr,  \bfnm Matthew \binits M. ( \byear 2009). \btitle Break detection in the covariance structure of multivariate time series models. \bjournal The Annals of Statistics \bvolume 37 \bpages 4046–4087. \endbibitem
  • Avanesov and Buzun (2018) {barticle} [author] \bauthor \bsnm Avanesov,  \bfnm Valeriy \binits V. and \bauthor \bsnm Buzun,  \bfnm Nazar \binits N. ( \byear 2018). \btitle Change-point detection in high-dimensional covariance structure. \bjournal Electronic Journal of Statistics \bvolume 12 \bpages 3254–3294. \bdoi 10.1214/18-EJS1484 \endbibitem
  • Bai and Saranadasa (1996) {barticle} [author] \bauthor \bsnm Bai,  \bfnm Zhidong \binits Z. and \bauthor \bsnm Saranadasa,  \bfnm Hewa \binits H. ( \byear 1996). \btitle Effect of high dimension: by an example of a two sample problem. \bjournal Statistica Sinica \bpages 311–329. \endbibitem
  • Bai and Silverstein (2010) {bbook} [author] \bauthor \bsnm Bai,  \bfnm Zhidong \binits Z. and \bauthor \bsnm Silverstein,  \bfnm Jack W \binits J. W. ( \byear 2010). \btitle Spectral analysis of large dimensional random matrices \bvolume 20. \bpublisher Springer. \endbibitem
  • Bai and Zhang (2023) {barticle} [author] \bauthor \bsnm Bai,  \bfnm Yansong \binits Y. and \bauthor \bsnm Zhang,  \bfnm Yong \binits Y. ( \byear 2023). \btitle The moderate deviation principles of likelihood ratio tests under alternative hypothesis. \bjournal Random Matrices: Theory and Applications \bpages 2350003. \endbibitem
  • Bao et al. (2022) {barticle} [author] \bauthor \bsnm Bao,  \bfnm Zhigang \binits Z., \bauthor \bsnm Hu,  \bfnm Jiang \binits J., \bauthor \bsnm Xu,  \bfnm Xiaocong \binits X. and \bauthor \bsnm Zhang,  \bfnm Xiaozhuo \binits X. ( \byear 2022). \btitle Spectral Statistics of Sample Block Correlation Matrices. \bjournal arXiv preprint arXiv:2207.06107. \endbibitem
  • Bodnar, Dette and Parolya (2019) {barticle} [author] \bauthor \bsnm Bodnar,  \bfnm Taras \binits T., \bauthor \bsnm Dette,  \bfnm Holger \binits H. and \bauthor \bsnm Parolya,  \bfnm Nestor \binits N. ( \byear 2019). \btitle Testing for independence of large dimensional vectors. \bjournal The Annals of Statistics \bvolume 47 \bpages 2977 – 3008. \bdoi 10.1214/18-AOS1771 \endbibitem
  • Chen and Gupta (2004) {barticle} [author] \bauthor \bsnm Chen,  \bfnm Jie \binits J. and \bauthor \bsnm Gupta,  \bfnm A. K \binits A. K. ( \byear 2004). \btitle Statistical inference of covariance change points in gaussian model. \bjournal Statistics \bvolume 38 \bpages 17-28. \bdoi 10.1080/0233188032000158817 \endbibitem
  • Chen and Jiang (2018) {barticle} [author] \bauthor \bsnm Chen,  \bfnm Huijun \binits H. and \bauthor \bsnm Jiang,  \bfnm Tiefeng \binits T. ( \byear 2018). \btitle A study of two high-dimensional likelihood ratio tests under alternative hypotheses. \bjournal Random Matrices: Theory and Applications \bvolume 7 \bpages 1750016. \endbibitem
  • Chen, Wang and Wu (2022) {barticle} [author] \bauthor \bsnm Chen,  \bfnm Likai \binits L., \bauthor \bsnm Wang,  \bfnm Weining \binits W. and \bauthor \bsnm Wu,  \bfnm Wei Biao \binits W. B. ( \byear 2022). \btitle Inference of Breakpoints in High-dimensional Time Series. \bjournal Journal of the American Statistical Association \bvolume 117 \bpages 1951-1963. \bdoi 10.1080/01621459.2021.1893178 \endbibitem
  • Cho and Fryzlewicz (2015) {barticle} [author] \bauthor \bsnm Cho,  \bfnm Haeran \binits H. and \bauthor \bsnm Fryzlewicz,  \bfnm Piotr \binits P. ( \byear 2015). \btitle Multiple-change-point detection for high dimensional time series via sparsified binary segmentation. \bjournal Journal of the Royal Statistical Society. Series B (Statistical Methodology) \bvolume 77 \bpages 475–507. \endbibitem
  • Csörgő and Horváth (1997) {bbook} [author] \bauthor \bsnm Csörgő,  \bfnm Miklós \binits M. and \bauthor \bsnm Horváth,  \bfnm Lajos \binits L. ( \byear 1997). \btitle Limit Theorems in Change-Point Analysis. \bseries Wiley Series in Probability and Statistics. \bpublisher Wiley, \baddress Chichester, UK. \endbibitem
  • Dette and Dörnemann (2020) {barticle} [author] \bauthor \bsnm Dette,  \bfnm Holger \binits H. and \bauthor \bsnm Dörnemann,  \bfnm Nina \binits N. ( \byear 2020). \btitle Likelihood ratio tests for many groups in high dimensions. \bjournal Journal of Multivariate Analysis \bvolume 178 \bpages 104605. \endbibitem
  • Dette and Gösmann (2020) {barticle} [author] \bauthor \bsnm Dette,  \bfnm Holger \binits H. and \bauthor \bsnm Gösmann,  \bfnm Josua \binits J. ( \byear 2020). \btitle A Likelihood Ratio Approach to Sequential Change Point Detection for a General Class of Parameters. \bjournal Journal of the American Statistical Association \bvolume 115 \bpages 1361-1377. \bdoi 10.1080/01621459.2019.1630562 \endbibitem
  • Dette, Pan and Yang (2022) {barticle} [author] \bauthor \bsnm Dette,  \bfnm Holger \binits H., \bauthor \bsnm Pan,  \bfnm Guangming \binits G. and \bauthor \bsnm Yang,  \bfnm Qing \binits Q. ( \byear 2022). \btitle Estimating a Change Point in a Sequence of Very High-Dimensional Covariance Matrices. \bjournal Journal of the American Statistical Association \bvolume 117 \bpages 444-454. \bdoi 10.1080/01621459.2020.1785477 \endbibitem
  • Dette and Tomecki (2019) {barticle} [author] \bauthor \bsnm Dette,  \bfnm Holger \binits H. and \bauthor \bsnm Tomecki,  \bfnm Dominik \binits D. ( \byear 2019). \btitle Determinants of block Hankel matrices for random matrix-valued measures. \bjournal Stochastic Processes and their Applications \bvolume 129 \bpages 5200–5235. \endbibitem
  • Dette and Wied (2016) {barticle} [author] \bauthor \bsnm Dette,  \bfnm Holger \binits H. and \bauthor \bsnm Wied,  \bfnm Dominik \binits D. ( \byear 2016). \btitle Detecting relevant changes in time series models. \bjournal Journal of the Royal Statistical Society: Series B (Statistical Methodology) \bvolume 78 \bpages 371-394. \bdoi https://doi.org/10.1111/rssb.12121 \endbibitem
  • Dörnemann (2022) {barticle} [author] \bauthor \bsnm Dörnemann,  \bfnm Nina \binits N. ( \byear 2022). \btitle Asymptotics for linear spectral statistics of sample covariance matrices. \bjournal Dissertation, Ruhr-Universität Bochum. \endbibitem
  • Dörnemann (2023) {barticle} [author] \bauthor \bsnm Dörnemann,  \bfnm Nina \binits N. ( \byear 2023). \btitle Likelihood ratio tests under model misspecification in high dimensions. \bjournal Journal of Multivariate Analysis \bvolume 193 \bpages 105122. \endbibitem
  • Dörnemann and Dette (2023) {barticle} [author] \bauthor \bsnm Dörnemann,  \bfnm Nina \binits N. and \bauthor \bsnm Dette,  \bfnm Holger \binits H. ( \byear 2023). \btitle Linear spectral statistics of sequential sample covariance matrices. \bjournal Annales de l’IHP Probabilités et Statistiques, to appear. \endbibitem
  • Dörnemann and Paul (2024) {barticle} [author] \bauthor \bsnm Dörnemann,  \bfnm Nina \binits N. and \bauthor \bsnm Paul,  \bfnm Debashis \binits D. ( \byear 2024). \btitle Detecting Spectral Breaks in Spiked Covariance Models. \bjournal arXiv preprint arXiv:2404.19176. \endbibitem
  • Enikeeva and Harchaoui (2019) {barticle} [author] \bauthor \bsnm Enikeeva,  \bfnm Farida \binits F. and \bauthor \bsnm Harchaoui,  \bfnm Zaid \binits Z. ( \byear 2019). \btitle High-dimensional change-point detection under sparse alternatives. \bjournal The Annals of Statistics \bvolume 47 \bpages 2051 – 2079. \bdoi 10.1214/18-AOS1740 \endbibitem
  • Galeano and Peña (2007) {barticle} [author] \bauthor \bsnm Galeano,  \bfnm Pedro \binits P. and \bauthor \bsnm Peña,  \bfnm Daniel \binits D. ( \byear 2007). \btitle Covariance changes detection in multivariate time series. \bjournal Journal of Statistical Planning and Inference \bvolume 137 \bpages 194-211. \bdoi https://doi.org/10.1016/j.jspi.2005.09.003 \endbibitem
  • Guo and Qi (2024) {barticle} [author] \bauthor \bsnm Guo,  \bfnm Wenchuan \binits W. and \bauthor \bsnm Qi,  \bfnm Yongcheng \binits Y. ( \byear 2024). \btitle Asymptotic distributions for likelihood ratio tests for the equality of covariance matrices. \bjournal Metrika \bvolume 87 \bpages 247–279. \endbibitem
  • Hall and Heyde (1980) {bbook} [author] \bauthor \bsnm Hall,  \bfnm Peter \binits P. and \bauthor \bsnm Heyde,  \bfnm Christopher C \binits C. C. ( \byear 1980). \btitle Martingale limit theory and its application. \bpublisher Academic press. \endbibitem
  • Heiny and Parolya (2023) {barticle} [author] \bauthor \bsnm Heiny,  \bfnm Johannes \binits J. and \bauthor \bsnm Parolya,  \bfnm Nestor \binits N. ( \byear 2023). \btitle Log determinant of large correlation matrices under infinite fourth moment. \bjournal Annales de l’Institut Henri Poincaré - Probabilités et Statistiques. \bnote https://www.e-publications.org/ims/submission/AIHP/user/submissionFile/54486?confirm=029ab46f . \endbibitem
  • Jiang and Qi (2015) {barticle} [author] \bauthor \bsnm Jiang,  \bfnm Tiefeng \binits T. and \bauthor \bsnm Qi,  \bfnm Yongcheng \binits Y. ( \byear 2015). \btitle Likelihood Ratio Tests for High-Dimensional Normal Distributions. \bjournal Scandinavian Journal of Statistics \bvolume 42 \bpages 988–1009. \endbibitem
  • Jiang and Yang (2013) {barticle} [author] \bauthor \bsnm Jiang,  \bfnm Tiefeng \binits T. and \bauthor \bsnm Yang,  \bfnm Fan \binits F. ( \byear 2013). \btitle Central limit theorems for classical likelihood ratio tests for high-dimensional normal distributions. \bjournal The Annals of Statistics \bvolume 41 \bpages 2029–2074. \endbibitem
  • Jirak (2015) {barticle} [author] \bauthor \bsnm Jirak,  \bfnm Moritz \binits M. ( \byear 2015). \btitle Uniform change point tests in high dimension. \bjournal The Annals of Statistics \bvolume 43 \bpages 2451 – 2483. \bdoi 10.1214/15-AOS1347 \endbibitem
  • Lavielle and Teyssiere (2006) {barticle} [author] \bauthor \bsnm Lavielle,  \bfnm Marc \binits M. and \bauthor \bsnm Teyssiere,  \bfnm Gilles \binits G. ( \byear 2006). \btitle Detection of multiple change-points in multivariate time series. \bjournal Lithuanian Mathematical Journal \bvolume 46 \bpages 287–306. \endbibitem
  • Li (2003) {barticle} [author] \bauthor \bsnm Li,  \bfnm Yulin \binits Y. ( \byear 2003). \btitle A martingale inequality and large deviations. \bjournal Statistics & Probability Letters \bvolume 62 \bpages 317–321. \endbibitem
  • Li and Chen (2012) {barticle} [author] \bauthor \bsnm Li,  \bfnm Jun \binits J. and \bauthor \bsnm Chen,  \bfnm Song Xi \binits S. X. ( \byear 2012). \btitle TWO SAMPLE TESTS FOR HIGH-DIMENSIONAL COVARIANCE MATRICES. \bjournal The Annals of Statistics \bvolume 40 \bpages 908–940. \endbibitem
  • Li and Gao (2024) {barticle} [author] \bauthor \bsnm Li,  \bfnm Zhaoyuan \binits Z. and \bauthor \bsnm Gao,  \bfnm Jie \binits J. ( \byear 2024). \btitle Efficient change point detection and estimation in high-dimensional correlation matrices. \bjournal Electronic Journal of Statistics \bvolume 18 \bpages 942–979. \endbibitem
  • Li, Li and Yao (2018) {barticle} [author] \bauthor \bsnm Li,  \bfnm Weiming \binits W., \bauthor \bsnm Li,  \bfnm Zeng \binits Z. and \bauthor \bsnm Yao,  \bfnm Jianfeng \binits J. ( \byear 2018). \btitle Joint central limit theorem for eigenvalue statistics from several dependent large dimensional sample covariance matrices with application. \bjournal Scandinavian Journal of Statistics \bvolume 45 \bpages 699–728. \endbibitem
  • Liu, Aue and Paul (2015) {barticle} [author] \bauthor \bsnm Liu,  \bfnm Haoyang \binits H., \bauthor \bsnm Aue,  \bfnm Alexander \binits A. and \bauthor \bsnm Paul,  \bfnm Debashis \binits D. ( \byear 2015). \btitle On the Marčenko–Pastur law for linear time series. \bjournal The Annals of Statistics \bvolume 43 \bpages 675 – 712. \bdoi 10.1214/14-AOS1294 \endbibitem
  • Liu, Gao and Samworth (2021) {barticle} [author] \bauthor \bsnm Liu,  \bfnm Haoyang \binits H., \bauthor \bsnm Gao,  \bfnm Chao \binits C. and \bauthor \bsnm Samworth,  \bfnm Richard J. \binits R. J. ( \byear 2021). \btitle Minimax rates in sparse, high-dimensional change point detection. \bjournal The Annals of Statistics \bvolume 49 \bpages 1081 – 1112. \bdoi 10.1214/20-AOS1994 \endbibitem
  • Liu et al. (2020) {barticle} [author] \bauthor \bsnm Liu,  \bfnm Bin \binits B., \bauthor \bsnm Zhou,  \bfnm Cheng \binits C., \bauthor \bsnm Zhang,  \bfnm Xinsheng \binits X. and \bauthor \bsnm Liu,  \bfnm Yufeng \binits Y. ( \byear 2020). \btitle A Unified Data-Adaptive Framework for High Dimensional Change Point Detection. \bjournal Journal of the Royal Statistical Society Series B: Statistical Methodology \bvolume 82 \bpages 933-963. \bdoi 10.1111/rssb.12375 \endbibitem
  • Lopes, Blandino and Aue (2019) {barticle} [author] \bauthor \bsnm Lopes,  \bfnm Miles E \binits M. E., \bauthor \bsnm Blandino,  \bfnm Andrew \binits A. and \bauthor \bsnm Aue,  \bfnm Alexander \binits A. ( \byear 2019). \btitle Bootstrapping spectral statistics in high dimensions. \bjournal Biometrika \bvolume 106 \bpages 781–801. \endbibitem
  • Page (1954) {barticle} [author] \bauthor \bsnm Page,  \bfnm E. S. \binits E. S. ( \byear 1954). \btitle Continuous Inspection Schemes. \bjournal Biometrika \bvolume 41 \bpages 100-115. \endbibitem
  • Parolya, Heiny and Kurowicka (2024) {barticle} [author] \bauthor \bsnm Parolya,  \bfnm Nestor \binits N., \bauthor \bsnm Heiny,  \bfnm Johannes \binits J. and \bauthor \bsnm Kurowicka,  \bfnm Dorota \binits D. ( \byear 2024). \btitle Logarithmic law of large random correlation matrices. \bjournal Bernoulli \bvolume 30 \bpages 346–370. \endbibitem
  • Ryan and Killick (2023) {barticle} [author] \bauthor \bsnm Ryan,  \bfnm Sean \binits S. and \bauthor \bsnm Killick,  \bfnm Rebecca \binits R. ( \byear 2023). \btitle Detecting Changes in Covariance via Random Matrix Theory. \bjournal Technometrics \bvolume 65 \bpages 480-491. \bdoi 10.1080/00401706.2023.2183261 \endbibitem
  • Van Der Vaart and Wellner (1996) {bbook} [author] \bauthor \bsnm Van Der Vaart,  \bfnm Aad W \binits A. W. and \bauthor \bsnm Wellner,  \bfnm Jon A \binits J. A. ( \byear 1996). \btitle Weak convergence. \bpublisher Springer. \endbibitem
  • Wald (1945) {barticle} [author] \bauthor \bsnm Wald,  \bfnm A. \binits A. ( \byear 1945). \btitle Sequential tests of statistical hypotheses. \bjournal Annals of Mathematical Statistics \bvolume 16 \bpages 117-186. \endbibitem
  • Wang, Han and Pan (2018) {barticle} [author] \bauthor \bsnm Wang,  \bfnm Xuejun \binits X., \bauthor \bsnm Han,  \bfnm Xiao \binits X. and \bauthor \bsnm Pan,  \bfnm Guangming \binits G. ( \byear 2018). \btitle The logarithmic law of sample covariance matrices near singularity. \bjournal Bernoulli \bvolume 24 \bpages 80–114. \endbibitem
  • Wang, Yu and Rinaldo (2021) {barticle} [author] \bauthor \bsnm Wang,  \bfnm Daren \binits D., \bauthor \bsnm Yu,  \bfnm Yi \binits Y. and \bauthor \bsnm Rinaldo,  \bfnm Alessandro \binits A. ( \byear 2021). \btitle Optimal covariance change point localization in high dimensions. \bjournal Bernoulli \bvolume 27 \bpages 554 – 575. \bdoi 10.3150/20-BEJ1249 \endbibitem
  • Wang et al. (2022) {barticle} [author] \bauthor \bsnm Wang,  \bfnm Runmin \binits R., \bauthor \bsnm Zhu,  \bfnm Changbo \binits C., \bauthor \bsnm Volgushev,  \bfnm Stanislav \binits S. and \bauthor \bsnm Shao,  \bfnm Xiaofeng \binits X. ( \byear 2022). \btitle Inference for change points in high-dimensional data via selfnormalization. \bjournal The Annals of Statistics \bvolume 50 \bpages 781–806. \endbibitem
  • Yin, Zheng and Zou (2023) {barticle} [author] \bauthor \bsnm Yin,  \bfnm Yanqing \binits Y., \bauthor \bsnm Zheng,  \bfnm Shurong \binits S. and \bauthor \bsnm Zou,  \bfnm Tingting \binits T. ( \byear 2023). \btitle Central limit theorem of linear spectral statistics of high-dimensional sample correlation matrices. \bjournal Bernoulli \bvolume 29 \bpages 984–1006. \endbibitem
  • Zhang, Wang and Shao (2022) {barticle} [author] \bauthor \bsnm Zhang,  \bfnm Yangfan \binits Y., \bauthor \bsnm Wang,  \bfnm Runmin \binits R. and \bauthor \bsnm Shao,  \bfnm Xiaofeng \binits X. ( \byear 2022). \btitle Adaptive Inference for Change Points in High-Dimensional Data. \bjournal Journal of the American Statistical Association \bvolume 117 \bpages 1751-1762. \bdoi 10.1080/01621459.2021.1884562 \endbibitem
  • Zheng, Bai and Yao (2015) {barticle} [author] \bauthor \bsnm Zheng,  \bfnm Shurong \binits S., \bauthor \bsnm Bai,  \bfnm Zhidong \binits Z. and \bauthor \bsnm Yao,  \bfnm Jianfeng \binits J. ( \byear 2015). \btitle Substitution principle for CLT of linear spectral statistics of high-dimensional sample covariance matrices with applications to hypothesis testing. \bjournal Annals of Statistics \bvolume 43 \bpages 546–591. \endbibitem

A Supplementary Material

A.1 proof of theorem 2.

Again, we start with an analog result for the non-centered version, which will be the main tool in the proof.

Theorem 6 .

Assume that Assumption (A1) and (A2) with δ > 0 𝛿 0 \delta>0 italic_δ > 0 are satisfied, and 𝔼 ⁢ [ x 11 ] = 0 𝔼 delimited-[] subscript 𝑥 11 0 \mathbb{E}[x_{11}]=0 blackboard_E [ italic_x start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT ] = 0 . Then, we have under Assumption (A4)

Proof of Theorem 6 .

By Theorem 4 , we have

(A.1)

In the following, we use the notations

(A.2)

where 1 ≤ i ≤ p . 1 𝑖 𝑝 1\leq i\leq p. 1 ≤ italic_i ≤ italic_p . As a preparation, we show that

(A.3)

For this purpose, we first note that for 1 ≤ i ≤ p 1 𝑖 𝑝 1\leq i\leq p 1 ≤ italic_i ≤ italic_p

Next, we decompose

(A.4)

The primary difference between X ~ i subscript ~ 𝑋 𝑖 \tilde{X}_{i} over~ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Y ~ i subscript ~ 𝑌 𝑖 \tilde{Y}_{i} over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in comparison to X i subscript 𝑋 𝑖 X_{i} italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Y i subscript 𝑌 𝑖 Y_{i} italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as defined in equations ( 4.12 ) and ( 4.14 ), respectively, stems from the fact that the components of 𝐜 i subscript 𝐜 𝑖 \mathbf{c}_{i} bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT exhibit varying variances under H 1 subscript 𝐻 1 H_{1} italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , yet they remain independent. In Section 4.1.1 , we have shown that

Through a careful examination of the arguments presented in Section 4.1.1 , it becomes apparent that this routine remains applicable even in the presence of heterogeneous variances. Specifically, we get along the lines of the proofs of ( 4.19 ), Lemma 1 and Lemma 2

which proves ( A.3 ). Next, we obtain with this estimate

(A.5)

In particular, in the case 𝚺 1 = … = 𝚺 n = 𝐈 subscript 𝚺 1 … subscript 𝚺 𝑛 𝐈 \mathbf{\Sigma}_{1}=\ldots=\mathbf{\Sigma}_{n}=\mathbf{I} bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = … = bold_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_I , we have

(A.6)

Then, we have using ( A.1 )-( A.6 )

(A.7)

Then, the assertion follows from Assumption (A4) . ∎

Proof of Theorem 2 .

where 𝐈 ^ cen superscript ^ 𝐈 cen \hat{\mathbf{I}}^{\textnormal{cen}} over^ start_ARG bold_I end_ARG start_POSTSUPERSCRIPT cen end_POSTSUPERSCRIPT is defined as 𝚺 ^ cen superscript ^ 𝚺 cen \hat{\mathbf{\Sigma}}^{\textnormal{cen}} over^ start_ARG bold_Σ end_ARG start_POSTSUPERSCRIPT cen end_POSTSUPERSCRIPT in ( 2.4 ) with the specific choice 𝚺 1 = … = 𝚺 n = 𝐈 . subscript 𝚺 1 … subscript 𝚺 𝑛 𝐈 \mathbf{\Sigma}_{1}=\ldots=\mathbf{\Sigma}_{n}=\mathbf{I}. bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = … = bold_Σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_I . In the same way, 𝐈 ^ i : j cen subscript superscript ^ 𝐈 cen : 𝑖 𝑗 \hat{\mathbf{I}}^{\textnormal{cen}}_{i:j} over^ start_ARG bold_I end_ARG start_POSTSUPERSCRIPT cen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i : italic_j end_POSTSUBSCRIPT is defined through ( 2.3 ) by setting 𝚺 = 𝐈 . 𝚺 𝐈 \mathbf{\Sigma}=\mathbf{I}. bold_Σ = bold_I . Then, we have by Theorem 4 and ( 4.2 )

In the proof of Theorem 6 , we have seen that the latter term diverges to − ∞ -\infty - ∞ under Assumption (A4) . ∎

A.2 Proof of Theorem 3

In this section, we prove that τ ^ ⋆ superscript ^ 𝜏 ⋆ \hat{\tau}^{\star} over^ start_ARG italic_τ end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT in ( 2.16 ) is a consistent estimator of the true change point t ⋆ superscript 𝑡 ⋆ t^{\star} italic_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , as claimed in Theorem 3 . To begin with, we consider this problem for the non-centered sample covariance estimator (and the corresponding sequential process) and define

(A.8)

Then, we have the following consistency result, which is proven later in this section.

Theorem 7 .

Suppose that Assumption (A1) - (A1) are satisfied, then we have

To prove Theorem 7 , we will make use of the notation

and the following auxiliary result.

It holds, as n → ∞ → 𝑛 n\to\infty italic_n → ∞ ,

Proof of Lemma 7 .

For the sake of brevity, we concentrate on a proof for the first assertion, since the second statement can be shown in a similar fashion. To begin with, we decompose similarly to ( A.4 )

Similarly to the proof of Lemma 2 , one can show that

(A.9)

Then, a combination of Lemma 2.2.2 in Van Der Vaart and Wellner ( 1996 ) and ( A.9 ) gives for any ε > 0 𝜀 0 \varepsilon>0 italic_ε > 0

(A.10)

where we used δ > 4 𝛿 4 \delta>4 italic_δ > 4 for the last estimate. By Lemma B.26 in Bai and Silverstein ( 2010 ) we get uniformly with respect t ∈ [ t 0 , 1 − t 0 ] 𝑡 subscript 𝑡 0 1 subscript 𝑡 0 t\in[t_{0},1-t_{0}] italic_t ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ]

(A.11)
(A.12)

Using ( A.11 ) and applying once again Lemma 2.2.2 in Van Der Vaart and Wellner ( 1996 ) , we obtain

(A.13)

Similarly to ( A.13 ), one can show using ( A.12 )

(A.14)

Then, by combining ( A.10 ), ( A.13 ) and ( A.14 ), the assertion follows. ∎

Now we are in the position to give the proof of Theorem 7 .

Proof of Theorem 7 .

Assume that H 1 subscript 𝐻 1 H_{1} italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT holds with change point t ⋆ ∈ [ t 0 , 1 − t 0 ] superscript 𝑡 ⋆ subscript 𝑡 0 1 subscript 𝑡 0 t^{\star}\in[t_{0},1-t_{0}] italic_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] and let t ∈ [ t 0 , 1 − t 0 ] . 𝑡 subscript 𝑡 0 1 subscript 𝑡 0 t\in[t_{0},1-t_{0}]. italic_t ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] . Then, we may decompose

(A.15)

By ( A.5 ), we have

(A.16)

Consider the case t ≤ t ⋆ . 𝑡 superscript 𝑡 ⋆ t\leq t^{\star}. italic_t ≤ italic_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT . Using Lemma 7 , we obtain uniformly with respect to t ∈ [ t 0 , 1 − t 0 ] 𝑡 subscript 𝑡 0 1 subscript 𝑡 0 t\in[t_{0},1-t_{0}] italic_t ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ]

(A.17)

Similarly to ( A.17 ), by an application of Lemma 7 , we have uniformly with respect to t ∈ [ t 0 , 1 − t 0 ] 𝑡 subscript 𝑡 0 1 subscript 𝑡 0 t\in[t_{0},1-t_{0}] italic_t ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ]

(A.18)

Using ( 4.16 ), we get uniformly with respect to t ∈ [ t 0 , 1 − t 0 ] 𝑡 subscript 𝑡 0 1 subscript 𝑡 0 t\in[t_{0},1-t_{0}] italic_t ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ]

(A.19)

Combining ( A.15 )-( A.19 ), we see that in the case t ≤ t ⋆ 𝑡 superscript 𝑡 ⋆ t\leq t^{\star} italic_t ≤ italic_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT

By assumption (A1) , this shows that, if t ≤ t ⋆ , 𝑡 superscript 𝑡 ⋆ t\leq t^{\star}, italic_t ≤ italic_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , we have

where the constant c t ∗ subscript 𝑐 superscript 𝑡 c_{t^{*}} italic_c start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and the o ℙ ⁢ ( 1 ) subscript 𝑜 ℙ 1 o_{\mathbb{P}}(1) italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) -term do not depend on t 𝑡 t italic_t . Here, the function f l subscript 𝑓 𝑙 f_{l} italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is given by

1 𝑥 x/(1+x)\leq\log(1+x) italic_x / ( 1 + italic_x ) ≤ roman_log ( 1 + italic_x ) yield the inequality

(A.20)

Next, we consider the case t > t ⋆ 𝑡 superscript 𝑡 ⋆ t>t^{\star} italic_t > italic_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT . Then, we have

(A.21)
(A.22)

uniformly with respect to t ∈ [ t 0 , 1 − t 0 ] . 𝑡 subscript 𝑡 0 1 subscript 𝑡 0 t\in[t_{0},1-t_{0}]. italic_t ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] . Thus, we obtain for the case t > t ⋆ 𝑡 superscript 𝑡 ⋆ t>t^{\star} italic_t > italic_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT using ( A.15 ), ( A.19 ), ( A.16 ), ( A.21 ) and ( A.22 )

uniformly with respect to t ∈ [ t 0 , 1 − t 0 ] 𝑡 subscript 𝑡 0 1 subscript 𝑡 0 t\in[t_{0},1-t_{0}] italic_t ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] . Consequently, if we define the function

then we have in the case t > t ⋆ 𝑡 superscript 𝑡 ⋆ t>t^{\star} italic_t > italic_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT

where the constant c t ∗ ′ superscript subscript 𝑐 superscript 𝑡 ′ c_{t^{*}}^{\prime} italic_c start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and the o ℙ ⁢ ( 1 ) subscript 𝑜 ℙ 1 o_{\mathbb{P}}(1) italic_o start_POSTSUBSCRIPT blackboard_P end_POSTSUBSCRIPT ( 1 ) -term do not depend on t 𝑡 t italic_t . Then, we get similarly to ( A.20 )

(A.23)

Note that the function

is continuous and thus, has unique minimizer t ⋆ superscript 𝑡 ⋆ t^{\star} italic_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT . Then ( A.20 ), ( A.23 ) combined with Corollary 3.2.3 in Van Der Vaart and Wellner ( 1996 ) implies that t ^ ⋆ superscript ^ 𝑡 ⋆ \hat{t}^{\star} over^ start_ARG italic_t end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT is a consistent estimator for t ⋆ . superscript 𝑡 ⋆ t^{\star}. italic_t start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT . ∎

Finally, we may conclude this section with the proof of Theorem 3 .

Proof of Theorem 3 .

Note that by ( 4.9 ), we have

almost surely, uniformly with respect to t ∈ [ t 0 , 1 − t 0 ] 𝑡 subscript 𝑡 0 1 subscript 𝑡 0 t\in[t_{0},1-t_{0}] italic_t ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] . ´Thus, by applying Corollary 3.2.3 in Van Der Vaart and Wellner ( 1996 ) as in the proof of Theorem 7 , the assertion of Theorem 3 follows.

A.3 Auxiliary results

Next, we prove Lemma 4 providing an approximation for the quantity σ 1 2 superscript subscript 𝜎 1 2 \sigma_{1}^{2} italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT defined below ( 4.23 ).

Proof of Lemma 4 .

Recalling the representation of σ 1 2 superscript subscript 𝜎 1 2 \sigma_{1}^{2} italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT below ( 4.23 ) we obtain

(A.24)

for 1 ≤ i ≤ p , 1 ≤ j < k ≤ p . formulae-sequence 1 𝑖 𝑝 1 𝑗 𝑘 𝑝 1\leq i\leq p,~{}1\leq j<k\leq p. 1 ≤ italic_i ≤ italic_p , 1 ≤ italic_j < italic_k ≤ italic_p . Then, we may write (replacing for a moment i 𝑖 i italic_i by i − 1 𝑖 1 i-1 italic_i - 1 )

(A.25)

Calculation of S i , 1 subscript 𝑆 𝑖 1 S_{i,1} italic_S start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT

By an application of the Sherman-Morrison formula, we obtain

(A.26)

𝑘 𝑗 1 k-j+1 italic_k - italic_j + 1 , yields

which implies by the i.i.d. assumption,

(A.27)
(A.28)

uniformly with respect to 1 ≤ i ≤ p 1 𝑖 𝑝 1\leq i\leq p 1 ≤ italic_i ≤ italic_p . Using ( A.26 ), ( A.27 ) and Lemma B.26 in Bai and Silverstein ( 2010 ) , we get for the first term

Combining this with ( A.28 ), we get

(A.29)

uniformly with respect to 1 ≤ i ≤ p 1 𝑖 𝑝 1\leq i\leq p 1 ≤ italic_i ≤ italic_p .

Calculation of S i , 2 subscript 𝑆 𝑖 2 S_{i,2} italic_S start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT

Similarly to the previous step, we may show that

(A.30)

Calculation of S i , 3 subscript 𝑆 𝑖 3 S_{i,3} italic_S start_POSTSUBSCRIPT italic_i , 3 end_POSTSUBSCRIPT

These terms will be further investigated in the following steps.

Calculation of 𝐒 i , 3 , 1 subscript 𝐒 𝑖 3 1 \mathbf{S}_{i,3,1} bold_S start_POSTSUBSCRIPT italic_i , 3 , 1 end_POSTSUBSCRIPT

Applying similar techniques as in the previous steps, we get

In the following, we use a general form of ( A.28 ), namely,

(A.31)

uniformly with respect to 1 ≤ i ≤ p . 1 𝑖 𝑝 1\leq i\leq p. 1 ≤ italic_i ≤ italic_p . This gives

(A.32)

Calculation of 𝐒 i , 3 , 2 subscript 𝐒 𝑖 3 2 \mathbf{S}_{i,3,2} bold_S start_POSTSUBSCRIPT italic_i , 3 , 2 end_POSTSUBSCRIPT

Again applying similar techniques as in the previous steps, especially ( A.26 ), ( A.27 ) and ( A.31 ), we get

(A.33)

Thus, we need to compute

(A.34)

Combining ( A.33 ), ( A.34 ) and Lemma 8 , we get

(A.35)

Using ( A.24 ) and ( A.25 ), we obtain

(A.36)
(A.37)

Using ( A.35 ), we get for the second term

(A.38)

which concludes the proof. ∎

For t 2 > t 1 , subscript 𝑡 2 subscript 𝑡 1 t_{2}>t_{1}, italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , we have

Proof of Lemma 8 .

𝑛 subscript 𝑡 1 1 𝑛 \mathbf{S}_{i,(\lfloor nt_{1}\rfloor+1):n} bold_S start_POSTSUBSCRIPT italic_i , ( ⌊ italic_n italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⌋ + 1 ) : italic_n end_POSTSUBSCRIPT ,

IMAGES

  1. PPT

    experiment null results

  2. (PDF) Explanation of the "Null" Result of the Michelson-Morley

    experiment null results

  3. Michelson-Morley Experiment/Null result

    experiment null results

  4. Frontiers

    experiment null results

  5. Results from Experiment 1 analyzed using: (a) null significance

    experiment null results

  6. A null controllability experiment for (17). Results given by ALG 2 on a

    experiment null results

VIDEO

  1. Experiment: Null

  2. Experiment null #slapbattles #experiment

  3. 11. In potentiometer experiment, null point is obtained at a particular point for a cell on potentio

  4. Demdike Stare

  5. In a meter bridge experiment, null point is obtained at 40cm from left end when unknown resistan

  6. (C.T.9.3)28. In a metrebridge experiment, null point is obtained at 20 cm from one end of the wire

COMMENTS

  1. Null result

    In science, a null result is a result without the expected content: that is, ... The experiment did appear to measure a non-zero "drift", but the value was far too small to account for the theoretically expected results; it is generally thought to be inside the noise level of the experiment. [7]

  2. Why Discovering 'Nothing' in Science Can Be So Incredibly Important

    "Next null experiment was testing for violation of angular momentum and rotation of the Universe. While it is conceivable, the null result is very important to our worldview and cosmology view and it was the initial motivation for me to use the cosmic microwave background radiation to observe and measure the Universe. That led to more null ...

  3. 10 Things Your Null Result Might Mean

    4.1 10. Your null needs to be published. If you addressed all the issues above related to intervention design, sample size and research design, and have a precisely estimated, statistically significant null result, it is time to publish. Your colleagues and other researchers need to learn from this finding, so do not keep it to yourself.

  4. So you got a null result. Will anyone publish it?

    A 2022 survey of scientists in France, for instance, found that 75% were willing to publish null results they had produced, but only 12.5% were able to do so 2. Over time, this bias in ...

  5. 8.1: The null and alternative hypotheses

    The Null hypothesis \(\left(H_{O}\right)\) is a statement about the comparisons, e.g., between a sample statistic and the population, or between two treatment groups. The former is referred to as a one-tailed test whereas the latter is called a two-tailed test. The null hypothesis is typically "no statistical difference" between the ...

  6. When and How to Interpret Null Results in NIBS: A Taxonomy Based on

    Orthogonal gradients of surprise and interpretability. A null result in an experiment aiming to replicate a well-established finding is very surprising. A null result that was hypothesized is not at all surprising. A null result in an exploratory study with no prior expectations can be received neutrally (middle of the continuum, not shown ...

  7. 5 tips for dealing with non-significant results

    A study published in Science by a team from Stanford University who investigated 221 survey-based experiments ... found that Registered Reports led to a much increased rate of null results: ...

  8. Filling in the Scientific Record: The Importance of Negative and Null

    Our role in the publishing ecosystem is to provide a complete, transparent view of scientific literature to enable discovery. While negative and null results can often be overlooked — by authors and publishers alike — their publication is equally as important as positive outcomes and can help fill in critical gaps in the scientific record.

  9. XENON's experimental triumph: No dark matter, but the best "null result

    XENON's experimental triumph: No dark matter, but the best "null result" in history; XENON's experimental triumph: No dark matter, but the best "null result" in history. July 27, 2022. When you're attempting to detect something you've never seen before, it's easy to fool yourself into thinking you've found what you're seeking.

  10. Want for nothing, need for null, useful output from negative results

    A negative result may predict a value of 2 when your desired result was 5, but if your result indicates "apples" instead of 5, you're "not even wrong.". Thus, separating useful negative results from inaccurate negative results requires rigorous peer review, discussed in the prior point. (3) Gamifying null metrics.

  11. How null results can be significant for physics education research

    evidence to reject the null hypothesis. This is considered a null result. NHST can be a powerful tool for researchers to advance the state of knowledge, by establishing associations between measured quantities (or lack thereof). This can be accomplished regardless of whether the p value makes the cutoff.

  12. The Value of Null Results

    Publishing null results helps the scientific community by allowing researchers to build on previous studies. The publication of unsuccessful attempts has the potential to save other researchers' time. Indeed, a new researcher may use a similar study protocol than a previously unpublished null study and risk having their findings go against ...

  13. Michelson-Morley Experiment

    The Michelson-Morley experiment is a perfect example of a null experiment, one in which something that was expected to happen is not observed. The consequences of their observations for the development of physics were profound. Having proven that there could be no stationary ether, physicists tried to advance new theories that would save the ...

  14. 'Null' research findings aren't empty of meaning. Let's publish them

    In an analysis of nearly 30,000 presentations made at scientific conferences, fewer than half were ultimately published in peer-reviewed journals, and negative or null results were far less likely ...

  15. Why null results rarely see the light of day

    Overall, 42% of the experiments produced statistically significant results. Of those, 62% were ultimately published, compared with 21% of the null results. However, the Stanford team was surprised that researchers didn't even write up 65% of the experiments that yielded a null finding. Scientists not involved in the study praise its "clever ...

  16. Null and void

    Null and void. Jonathan Knight. Nature 422, 554-555 (2003) Cite this article. 1478 Accesses. 61 Citations. 2 Altmetric. Metrics. Many scientific studies produce negative results that never see ...

  17. Negative Results, Null Results, or No Results ...

    Sometimes, the results of such a study are called 'null results'. They may also be called 'negative results'. In my opinion, both of these terms are useful, although I slightly prefer 'null' on the grounds that the term 'negative' tends to draw an unfavorable contrast with 'positive' results. Whereas, my impression is that 'null' makes it clear ...

  18. Null results should produce answers, not excuses

    It is not as much of a problem when the study yields positive results, but leaves us with excuses when the study yields null results. The third step to getting answers from null results is ensuring there is enough statistical power to estimate a meaningful minimum detectable effect. This step is well known, and it is fairly common now to see ...

  19. PDF RE SA CH T N PAREN Y Why null results rarely see the light of day

    42% of the experiments produced statisti-cally significant results. Of those, 62% were ultimately published, compared with 21% of the null results. However, the Stanford team was surprised that researchers didn't even write up 65% of the experiments that yielded a null finding. Scientists not involved in the study praise its "clever" design.

  20. Null & Alternative Hypotheses

    The null hypothesis (H0) answers "No, there's no effect in the population.". The alternative hypothesis (Ha) answers "Yes, there is an effect in the population.". The null and alternative are always claims about the population. That's because the goal of hypothesis testing is to make inferences about a population based on a sample.

  21. Replication of "null results"

    Such a "null result" - typically characterized by a p-value p > 0.05 for the null hypothesis of an absent effect - may also occur if an effect is actually present. For example, if the sample size of a study is chosen to detect an assumed effect with a power of 80%, null results will incorrectly occur 20% of the time when the assumed ...

  22. In praise of replication studies and null results

    The institute is offering its researchers €1,000 (US$1,085) for publishing either the results of replication studies — which repeat an experiment — or a null result, in which the outcome is ...

  23. How to Write About Negative (Or Null) Results in ...

    Negative results tell researchers that they are on the wrong path, or that their current techniques are ineffective. This is a natural and necessary part of discovering something that was previously unknown. Solving problems that lead to negative results is an integral part of being an effective researcher. Publishing negative results that are ...

  24. Why Scientists Should Celebrate Failed Experiments

    Null results, after all, are still results, and once they're in the literature, they help other researchers avoid experimental avenues that have already proven to be dead ends.

  25. Detecting Change Points of Covariance Matrices

    In numerical experiments given in Section 3, we compare the finite-sample size properties of our test as well as the change-point estimator to other approaches. ... 4 Proofs of main results under the null hypothesis 4.1 Proof of Theorem 1. Throughout this section, ...