**II.** **(May 28):** **N-P and Fisherian ****Tests, Severe Testing:** How to avoid fallacies of tests

**Reading:**

**SIST: Excursion 3 Tour I **(focus on pages up to p. 152): 3.1, 3.2, 3.3** **

** Recommended:** Excursion 2 Tour II pp. 92-100 (Sections 2.4-2.7)

* Optional:* I will (try to) answer questions on demarcation of science, induction, falsification, Popper from Excursion 2 Tour II (Section 2.3)

(Use comments on this blog for queries we don’t get to in the seminar. The first comment you write is sent to moderation to be approved; after that it’s automatic.)

* Handout*:

*Areas Under the Standard Normal Curve*

5 minute refresher on means, variance, standard deviations, and the Normal distribution, standard normal

**General Info Items: **

**-References: Captain’s Bibliography **

**–Souvenirs: **Meeting 1: A-D; **Meeting 2 Souvenirs:** (E) An Array of Questions, Problems, Models, (I) So What Is a Statistical Test, Really?, (J) UMP Tests, (K) Probativism

[Souvenirs from optional pages–they’re free: (F) Getting Free of Popperian Constraints on Language, (G) The Current State of Play in Psychology, (H) Solving Induction Is Showing Methods with Error Control]

**-Summaries of 16 Tours (abstracts & keywords) **

**–Excerpts & Mementos on Error Statistics Philosophy Blog**

-Mementos from Excursion 2 Tour II: Falsification, Pseudoscience, Induction 2.3-2.7

**Mayo Memos for Meeting 2:**

5/27 Today (27 May) is the statistician Allan Birnbaum’s birthday. I put up a blogpost (on my Error Statistics Philosophy blog) with a volume on foundations of statistics that *Synthese* published in his honor in 1977.

5/27 Sam Fletcher’s review essay of my book SIST is up at the journal Philosophy of Science

**Slides & Video Links for Meeting 2:**

**Slides:
**Meeting #2 main slides (PDF)

Supplemental slides (Likelihoodist vs. Significance Tester w/ Bernoulli Trials) (PDF)

Thanks for the seminar today. I wanted to ask something along the lines of Konstantinos’ previous comment (https://phil-stat-wars.com/2020/05/22/meeting-1-may-21/#comment-6), “is severity the best we can do?”

One of the major drivers pushing me to investigate foundations as deeply as I can is the experience of many years being the statistician consulted by various health researchers. The statistical ritual is a little part of their largely inductive procedure. Your Souvenir E spells it out nicely (https://philstatwars.files.wordpress.com/2020/05/sist-souvenir-e.pdf).

Now, one paper I appreciate is Platt’s “Strong Inference” from 1964. In that, he conceptualises (successful, quantitative) scientific practice as an iterative sequence of (quasi-)deductive investigations, with inductive bridges from them to the next investigation. This sounds about right to me. This links to Duhem’s problem: our falsification points to an unknown problem in a long spiral staircase of deductive (but that’s ok, we can use severity) and inductive (hmmm…) steps.

I wondered what your thoughts were on the potential to protect ourselves against inflated error rates arising from the inductive parts: the generation of the hypothesis for the next test. One problem that severity doesn’t solve is assuming stasis in a complex system, and a lot of efforts at evidence-based policy are undermined by this.

Thanks for your comment. Just on your last point (I’ll come back to the rest later), severity wouldn’t assume stasis. (I can’t tell if you’re suggesting it would.) Rather, it would direct researchers to include the threat of erroneously assuming it in their repertoires of error.

Thanks for this. I think that some of the safeguards we can put in place against wrong inferences are quantitative and can be built into our calculations, while others are more nebulous and just require the researcher to sit and ponder. The former belong in the deductive parts of Platt’s spiral and the latter are somewhat in the inductive parts. A simple example might be response bias to a survey. The only safeguard is that the researcher do their job well, and I suppose we have peer review, though I groan inwardly at the thought.

I think you are absolutely right to make a big deal of the fact that Bayes, likelihood principle and N-P testing all provide a logic of some kind, which can be executed on the calculator, but fail to safeguard against BENT via the inductive parts, which are executed in the brain. So, in looking to strengthen them, we could either look to regulate researcher behaviour (which has been quite successful in clinical trials) or to promote more complex designs and/or analyses that can account for many problems that manifest in bias or non-exchangeability: stick and carrot, respectively.

These seem (relatively!) easy to develop for the computational / deductive parts of the spiral, but not for the cerebral / inductive parts. So, severity might tell the researcher to think of society or whatever as an adaptive system, but only in a vague, conversational way. There’s no guarantee they would act on that. They could step from one over-simplified deductive investigation (quantitatively severe within its own context) to the next. In that setting, severity is another tick-box exercise.

The complex systems point is in this paper (http://robertgrantstats.co.uk/papers/complex-systems-explanation-policy.pdf) although it is a little too brief. (The first draft’s “Near Future” scenarios had fictitious statements by “Prime Minister Boris Johnson”, which at the time was a ludicrous concept, about following the science. Happily, Rick Hood convinced me to cut them out.)

I don’t know if there is any hope for a way forward in that setting. My approach over the last 5-10 years (in context of health and social care research, mostly) has been more complex models: latent variables + Bayes. But that doesn’t solve every practical data problem, as Stephen Senn would be sure to point out, introduces some new ones of its own, and still leaves the cerebral-inductive parts open to bad practice.

Robert: I’m distinguishing “logics of evidence” based on the likelihood principle and N-P testing & other error statistics methods. The latter, but not the former, calls for considering the capabilities of the tools to probe erroneous interpretations of data. In formal statistical settings we look to error probabilities to assess and control those error probabilities. Excursion 2, Tour I discusses “logic” of inference as I’m using the term.

I will just let other participants react further to your comment.

I learned today that Ronald Giere died. He was a very close personal & professional friend for many years, & was the major person who encouraged me in philosophy of science when I first started out. There’s a paper of his on Birnbaum (with whom he worked) on my current blogpost (on my error statistics philosophy blog) https://errorstatistics.files.wordpress.com/2020/05/giere_allan-birnbaums-conception-of-statistical-evidence-red.pdf

Giere gave me all of his files on Phil Stat! (several file cabinets full). It was only thanks to the “hidden Neyman” papers in this treasure trove that I discovered quite a few things. You can read about it in several related posts below. They deal with Neyman, power, and an animal I call “shpower”, so you might find them of interest.

Neyman’s nursery 1:

https://errorstatistics.com/2011/10/22/the-will-to-understand-power-neymans-nursery/

Neyman’s nursery 2:

https://errorstatistics.com/2011/11/09/neymans-nursery-2-power-and-severity-continuation-of-oct-22-post/

Neyman’s nursery 3:

https://errorstatistics.com/2011/11/12/neymans-nursery-nn-3-shpower-vs-power/

Neyman’s nursery 4:

https://errorstatistics.com/2011/11/15/logic-takes-a-bit-of-a-hit-nn-4-continuing-shpower-observed-power-vs-power/

Neyman’s nursery 5:

https://errorstatistics.com/2011/11/18/neymans-nursery-nn5-final-post/

Remind me to explain about Neyman’s “inductive Behavior”–something a bit beyond what I said during today’s meeting.

Dear Professor Mayo,

Here is the question I was trying to ask during the seminar, but I didn’t explain it well, so I’ll give it another go!

Question: PDF

Margherita:

If you email me your comment with the notation you want, pdf or screen shot, I can link it to your comment. Then I can write out the explanation clearly. Thank you!

Margherita: Your pdf is up, but remember that the SE of M, write it SE(M) is sigma over the square root of n, so in your case 2 with sigma = 10,000 and sample size n = 100, we have SE(M )=1000. So, we’d need M = 1150 to exceed 150 by 1 SE(M). So revise your example to illustrate the point you’re after and send it back. Thanks.

Remember, too, that a SEV assessment always refers to a Test, the value of the outcome or test statistics, and a claim or inference of interest. It is only if the test in question is understood that we can allude to SEV(claim C). Do send your updated question & I’ll address it.

I’m going to investigate the best way to put symbols in comments, because I know it can be done.

Margherita sent an update:

Margherita: Thank you for your interesting comment. I thought we might have figured out good ways to get symbols in comments by now, but this won’t be pursued until Monday. I will avoid them, but try to paste from the book. The large sample size clearly gives a more sensitive test, so much so that even a statistically significant difference with a small p-value may indicate a trivially small discrepancy, if it occurred with a sufficiently high sample size. What counts as trivially small depends on the example. In our 1-sided test, we are looking for discrepancies in excess of a fixed test hypothesis. (I notice you use the word “estimate”, and while tests & estimates are interrelated, the issue here concerns a test with a fixed test or null hypothesis.

It’s not that we don’t want sensitive tests, we do. We just want to distinguish the discrepancies indicated in relation to the sensitivity. Cox, Good, and others have given proposed formulas to lower the required p-value as samples sizes increase. My point is simply that even without such a move, we can still distinguish the discrepancies well or poorly indicated by taking account of the sample size. I probably should have waited until the meeting where we explicitly discuss large n, Jeffreys-Lindley (w/ the opening joke by Jackie Mason) (4th meeting, at present) to raise this example, because it doesn’t occur yet in SIST. There (SIST p. 239), we need to address a criticism such as Kadane’s:

Another way to look at it is that a severity assessment always requires reporting inferences that have been poorly indicated. In the case of finding a small P-value in a 1-sided test & inferring an indication or evidence of a positive discrepancy, you want to deliberately report discrepancies that are poorly indicated. With two results that are just statistically significant results at level p, the discrepancy poorly indicated by the result achieved with higher sample size is closer to the null than the one achieved with a smaller sample size.

Dear Professor Mayo, thank you so much for your reply. I think I now have a better understanding of how a severity assessment, i.e. the reporting of discrepancies that are well or poorly indicated, allows us to distinguish between more or less sensitive tests (without having to make moves such as lowering the required p-value as the sample size increases); so I think my worries have been answered, thank you!

But I have another quick question (sorry!): in all the examples we have looked at so far, we have evaluated the severity with which a composite hypothesis (e.g. mu>152) passes a test T with data x. Is it also possible to evaluate the severity with which a point hypothesis (e.g. mu=152) passes a test with data x?

(I ask this because I read the following passage from “Of War or Peace? Essay Review of Statistical Inference as Severe Testing” by Fletcher (2020):

“Mayo has a separate description of the severity criteria needed for evidence of a composite hypothesis logically stronger than the negation (265–6, 351–2): essentially, the severity for each simple component of the hypothesis must be sufficiently high” (p.4)

And I found it confusing…!)

Margherita:

Thank you so much for your comments. No what Fletcher writes in that sentence isn’t correct, if he means severity for statistical point hypotheses. I will have to read the context of the statement in his review (which I only discovered was out the other day).

I’m afraid the context doesn’t help, and there are some general, important, issues with his defn. of severity. Fletcher’s review is interesting, and I know he worked very hard on it. I did send him some very detailed, constructive, suggested improvements–most notably on the defn of severity– after he sent me this some months ago. It may be instructive to take those issues up separately in a later seminar meeting, with sufficient attention to detail.