去年秋季学期的最后四分之一学假设检验的时候我就很疑惑，人类要怎么才能记住这些规则，这有什么道理吗（

于是最后在作业期中期末占比约各三分之一、期中考了 94、wjd 老师海底捞调分的情况下考了个 A-（，就挺耻辱的。

时隔一年学了 statistical decision theory 之后我逐渐理解一切（

非常短的一集，我随便喵两句。

12.8 upd: 我超，我给学杂了，pivot rule 是 linear programming 里的，这玩意叫 pivot statistic，翻译过来叫枢轴量（

Statistical Decision Theory

多快好省的 decision theory 复习。

Nonrandomized Decision

Let \(X\) be a sample from a population \(P \in \mathcal P\). A statistical decision theory is an action that we take after observing \(X\), for example, a conclusion about \(P\) or a characteristic about \(P\) according to the observation. We use \(\mathbb A\) to denote the set of allowable actions, and let \(\mathcal F_{\mathbb A}\) be a \(\sigma\)-field on \(\mathbb A\). Then the measurable space \((\mathbb A, \mathcal F_{\mathbb A})\) is called the action space.
Let \(\mathcal X\) be the range of \(X\) and \(\mathcal F_{\mathcal X}\) be a \(\sigma\)-field on \(\mathcal X\). A decision rule is a measurable function (actually a statistic) \(T\) from \((\mathcal X, \mathcal F_{\mathcal X})\) to \((\mathbb A, \mathcal F_{\mathbb A})\). If a decision rule \(T\) is chosen, then we take the action \(T(X)\) whence \(X\) is observed. That's where the name "decision rule" comes from.
In statistical decision theory, we set a criterion using a loss function \(L\), which is a function from \(\mathcal P \times \mathbb A\) to \([0, +\infty)\), and is Borel on \((\mathbb A, \mathcal F_{\mathbb A})\) for each fixed \(P \in \mathcal P\). If \(X=x\) is observed and our decision rule is \(T(X)\), then our "loss" in making the decision \(T(X)\) is \(L(P, T(x))\), which is still random according to \(P\).
The average loss for the decision rule \(T\), which is called the risk of \(T\), is defined to be

\[R_T(P) = E[L(P,T(X))] = \int_{\mathcal X} L(P, T(x)) dP_X(x).\]

If the family \(\mathcal P\) is a parameter family then the loss function and the risk can also be denoted as \(L(\theta, T(X))\) and \(R_T(\theta)\).

Randomized Decision

Sometimes it is more useful to consider the randomized decision rules. A randomized decision rule is a function \(\delta\) on \(\mathcal X \times \mathcal F_{\mathbb A}\) such that, for every \(A \in \mathcal F_{\mathbb A}\), \(\delta(\cdot, A)\) is a Borel function and, for every \(x \in \mathcal X\), \(\delta(x, \cdot)\) is a probability measure on \((\mathbb A, \mathcal F_{\mathbb A})\).

The nonrandomized decision rule \(T\) previously discussed can be viewed as a special randomized decision rule with \(\delta (x, \{a\}) = I_{\{a\}}(T(x)), a \in \mathbb A, x \in \mathcal X\).

The loss function for a randomized rule \(\delta\) is defined as \(L(P,\delta,x) = \int_{\mathbb A} L(P,a) d\delta (x,a)\), and the risk is then

\[R_\delta(P) = E[L(P,\delta,X)] = \int_{\mathcal X} \int_{\mathbb A} L(P,a) d\delta (x,a) dP_X(x).\]

Test Evaluation

A rule \(T_1(X)\) is as good as \(T_2(X)\) if and only if \(R_{T_1}(P) \leq R_{T_2}(P)\) for any \(P \in \mathcal P\).

A rule \(T_1(X)\) is better than \(T_2(X)\) if and only if \(R_{T_2}(P)\) if and only if \(R_{T_1}(P) \leq R_{T_2}(P)\) for any \(P \in \mathcal P\), and there exists at least one \(P \in \mathcal P\) s.t. \(R_{T_1} (P) < R_{T_2}(P)\).

Two decision rules are equivalent if and only if \(R_{T_1}(P) = R_{T_2}(P)\) holds for any \(P\in \mathcal P\).

If there is a decision rule \(T^*\) that is as good as any other rule in \(\mathfrak F\), a class of allowable decision rules, then \(T^*\) is said to be \(\mathfrak F\)-optimal (or optimal if \(\mathfrak F\) contains all possible rules).

Hypothesis Testing

Fundamental Settings of Nonrandomized Testing

都见过但陌生又熟悉.jpg

Hypothesis Testing Problem

Let \(\mathcal P\) be a family of distributions, \(\mathcal P_0 \subset P\) and \(\mathcal P_1 = \mathcal P \setminus \mathcal P_0\). A hypothesis testing problem can be formulated as that of deciding which of the following two statements are true:

\[H_0 : P \in \mathcal P_1 \quad \text{versus} \quad H_1 : P \in \mathcal P_1\]
The action space for this problem contains only two elements, i.e., \(\mathbb A = \{0,1\}\), where \(0\) is the action of accepting \(H_0\) and \(1\) is the action of accepting \(H_1\). A decision rule is called a test, and must has the form \(T(X) = I_C(X)\), in which \(C \in \mathcal F_{\mathcal X}\) is called the rejection region (because if \(X\in C\) we take \(T(X) = 1\), i.e. reject \(H_0\)).

Loss and Risk

A simple loss function for the problem is the \(0-1\) loss: \(L(P,a)=0\) if a correct decision is made and \(L(P,a)=1\) otherwise. Under this loss, the risk is

\[R_T(P) = \begin{cases} P(T(X)=1) = P(X\in C)& \quad P \in \mathcal P_0 \\ P(T(X)=0) = P(X \in C^c) &\quad P \in \mathcal P_1 \end{cases} = P(X \in C) I_{\mathcal P_0}(P) + P(X \notin C)I_{\mathcal P_1}(P).\]
There are two types of statistical errors we may commit: rejecting \(H_0\) when \(H_0\) is true (called the type I error) and accepting \(H_0\) when \(H_0\) is wrong (called the type II error).

In statistical inference, a test \(T\), which is a statistic from \(\mathcal X\) to \(\{0,1\}\), is assessed by the probabilities of making two types of errors (w.r.t. the \(0-1\) loss function):

\[\alpha_T(P) = P(T(X)=1) = P(X\in C) \quad P\in P_0\]

\[1-\alpha_T(P) = P(T(X)=0) = P(X \notin C) \quad P \in \mathcal P_1\]

These two error probabilities cannot be minimized or even bounded by a fixed \(\alpha \in (0,1)\) simultaneouly when we have a fixed sample size.

How to Reach the Optimal Test

A common approach to finding an optimal test is to assign a small bound \(\alpha\) to one of the error probabilities (which will also leads to a small rejection region/), say, \(\alpha_T(P), P \in \mathcal P_0\), and then attempt to minimize the other one subject to

\[\sup_{P \in \mathcal P_0} \alpha_T(P) \leq \alpha.\]

The small bound \(\alpha\) is called the level of significance, and the left side is called the size of the test \(T\).

Actually we're using the minimax rule w.r.t. the type II error, i.e. to minimize \(\sup_{P \in \mathcal P_1} 1- \alpha_T(P)\) under the constraint \(\sup_{P\in \mathcal P_0} \alpha_T(P) \leq \alpha\), which will give a minimax rule \(T_\alpha^*(X)\) as the test at level of significance \(\alpha\).

The Famous P-value

It's a good practice to assess the smallest possible level of significance at which \(H_0\) would be rejected for the computed \(T_\alpha^*(x)\) after observing \(x\), i.e. \(\hat \alpha(x) = \inf \{\alpha \in (0,1) : T_\alpha^*(x)=1 \}\). Such \(\hat \alpha (x)\) is also a statistic depending on observed \(x\), and is called the p-value for the test \(T_\alpha ^*\).

The test can also be interpreted as \(T_\alpha^* (x) = I_{(0,\alpha)} (\hat \alpha(x))\), thus we can find the p-value here.

Example of Nonrandomized Parameter Test

举个例子来结束这些神神叨叨。

Example 1: Let \(X_1,X_2, \cdots, X_n\) be i.i.d. from the \(N(\mu,\sigma^2)\) distribution with an unknown \(\mu \in \mathbb R\) and a known \(\sigma^2 >0\). Consider the hypothesis \(H_0 : \mu \leq \mu_0 \quad \text{versus} \quad H_1 : \mu > \mu_0\), where \(\mu_0\) is a fixed constant. Since the sample mean \(\bar X\) is sufficient for \(\mu \in \mathbb R\), it is reasonable to consider the following class of tests: \(T_c(X) = I_{(c,\infty)}(\bar X)\).

By the property of normal distributions, \(\alpha_{T_c}(\mu) = P(T_c(X)=1) = 1-\Phi \left(\frac{\sqrt n (c - \mu)}{\sigma} \right)\), and for some level of significance \(\alpha\) we consider the type I error constraint \(\sup _{\mu \leq \mu_0} \alpha_{T_c}(\mu) = 1-\Phi \left(\frac{\sqrt n (c-\mu_0)}{\sigma} \right) \leq \alpha\), then the optimal test should satisfy \(c_\alpha \geq \sigma z_{1-\alpha} / \sqrt n + \mu_0\).

The next step is to minimize \(1-\alpha_{T_c}(\mu)=\Phi \left(\frac{\sqrt n (c - \mu)}{\sigma} \right)\) under the constraints \(c_\alpha \geq \sigma z_{1-\alpha} / \sqrt n + \mu_0\) and \(\mu > \mu_0\), the optimal test is \(c_\alpha ^*= \sigma z_{1-\alpha} / \sqrt n + \mu_0\), and \(T_{c_\alpha ^*} (X) = I_{(c_\alpha ^*, \infty)}(\bar X)\).

According to the definition of p-value \(\hat \alpha(x) = \inf\{\alpha \in (0,1): T_{c_\alpha ^*} (x) =1\}\), we can obtain \(\sigma z_{1-\hat \alpha(x)} / \sqrt n + \mu_0= \bar x\) for any observed \(x\). Thus \(\hat \alpha(x) =1-\Phi(\frac{\sqrt n}{\sigma} \left(\bar x - \mu_0 \right))\) as a function of the observed data.

Randomized Test

In Example 1, the equality in \(\sup_{P \in \mathcal P_0} \alpha_T(P) \leq \alpha\) can always be achieved by a suitable choice of \(c\). This is not true in general. In such cases where the equality can't be attained, we may consider randomized tests.

A randomized decision rule is a probability measure \(\delta(x,\cdot)\) on the action space for any fixed \(x\). Since the action space \((\mathbb A, \mathcal F_{\mathbb A} )\) contains only two points \(\mathbb A = \{0,1 \}\), then any randomized test of a hypothesis testing problem is equivalent to a statistic \(T(X) \in [0,1]\), with \(T(x) = \delta(x,\{1\})\) and \(1-T(x) = \delta (x,\{0\})\). In other words the expectation of \(T(X)I_{(0,1)}\) must be \(0\).

In contrast, a nonrandomized test is a special case of randomized test, where \(T(X)\) doesn't take value in \((0,1)\).
For any randomized test \(T(X)\), we define the type I error probability to be \(\alpha_T(P) = ET(X), P \in \mathcal P_0\), and the type II error probability to be \(1-\alpha_T(P) = 1-ET(X), P \in \mathcal P_1\). The optimization rules are the same as the nonrandomized case.

Example 2: Assume that the sample \(X\) from a binomial distribution \(B(\theta,n)\) with an unknown \(\theta \in (0,1)\) and a fixed integer \(n >1\). Consider the hypothesis \(H_0 : \theta \in (0, \theta_0] \quad \text{versus} \quad H_1 : \theta \in (\theta_0 ,1)\) and the following class of randomized tests:

\[T_{j,q}(X) = \begin{cases} 1 \quad & X>j \\ q \quad & X=j \\ 0 \quad & X <j \end{cases}\]

where \(j=0,1,\cdots,n-1\) and \(q \in [0,1]\). Then

\[\alpha_{T_{j,q}}(\theta) = P(X>j) + qP(X=j), \; 0 < \theta \leq \theta_0.\]

For any \(\alpha \in (0,1)\) there exists an integer \(j\) and \(q \in (0,1)\) such that the size of \(T_{j,q}\) is exactly \(\alpha\), i.e. the upper bound is attained through randomized test.

Confidence Sets

Let \(\theta\) be a \(k\)-vector of unknown parameters related to the unknown population \(P \in \mathcal P\), \(C(X)\) in the range of \(\theta\) and only depends on the sample \(X\). If \(\inf_{P \in \mathcal P} P(\theta \in C(X)) \geq 1-\alpha\), then \(C(X)\) is called a confidence set for \(\theta\) with level of significance \(1-\alpha\).

Actually if the constraint above holds, the coverage probability of \(C(X)\) is at least \(1-\alpha\), though \(C(x)\) either covers or doesn't cover \(\theta\) whence we observe \(X=x\). To be more , the coverage probability implies that when we make \(n\) random observations of \(X=x\) from the population and \(n\) confidence regions correspondingly, then there are about \(n (1-\alpha)\) among them covers \(\theta\).

Comments

一些不负责任的暴论/吐槽（

上统推的时候 wjd 说“一般情况下倾向于拒绝 \(H_0\)”的时候就给我整震撼了，那我把 \(H_0\) 和 \(H_1\) 换一下不行吗（。你早说 decision rule 就是个示性函数，\(0\) 对应 \(H_0\)，\(1\) 对应 \(H_1\) 的话我还至于一年来都没记清楚 type I error 和 type II error 分别是什么吗（x
另一个不能理解的点是 randomized test，看她举的例子还以为这个东西就是用来做 \(H_0: \theta = \theta _0 \; \text{versus} \; H_1 : \theta \neq \theta_0\) 这种检验的，然后给 \(\theta = \theta_0\) 一个 \(q \in (0,1)\) 的值。感觉没怎么说过是因为达不到 upper bound 有点浪费所以用单点处来补足的这个想法，~~可能也不排除当时已经是网课了所以是我听课不认真~~（

但也不可能先把 randomized rule 讲一遍，再把 minimax rule 讲一下说明为什么是缩小 \(\sup \alpha_T(P)\)，在这之前总得把 measure 是啥说清楚，我说实话很怀疑很多人修统辅到最后连概率空间和 \(\sigma\)-field 都没完全搞清楚，~~一直在安慰大家这个不考那个不考的 dwl 全责~~。但有的时候也很佩服统辅这些课能把故事完全限制在初等上还能讲明白~~（大概吧，我经常是听不明白，但看起来大家都很明白）~~的功力（
感觉就是之前一直没学本质，当时前四周的 statistics 看起来有点难但学会了 exponential family 的算法之后就薄纱一切了，反正大家也不会条件期望 ~~dwl 全责~~ 不可能真拿定义算 sufficient statistics 和 UMVUE；后半学期我就在疑惑这统计真的和数学有关系吗（，讲完基本概念之后就开始讲那些个 normal distribution family 上假设检验的例子，最后还要像八股文一样地考，可能确实是限制在初等上的话实在没什么可学的（
总之虽然吐槽了很多，我还是很认可去年统计推断这门课的，但只能说真想学点数学 ~~而不只是薄纱大家来愉悦身心（有的时候可能只会被反杀罢）~~ 的话还是别修统辅课了，调性不合（。

~~所以说去年这时候确实是对自己很没信心啊（~~

~~今年也没有啊（~~

『姑妄言之姑妄听之』

别惦记你那 pivot statistic 了