高等数理统计 I 比统计推断多了啥

挂机养生了,统计中心的课在这学期乱七八糟的课中间显得像个弟弟(x

Lecture 1

我为了这门课给 Weichi Wu 发过两封邮件,一封是上学期问这课有多高等、能不能手选,另一封是前两天因为实在查不到上课教室问了他一下,然后都被已读不回了。最后是找一个去了统计中心的九字班环友问了教室线下偷袭的。

本来以为他是不欢迎手选,课后问了一句他说没看到邮件(,想选就选,怎么回事。然后他问我有没有学过今天的内容,跟我说这课到后面可能会很难,可能以为我是统辅的,我又懒得再解释一遍也不知道怎么说能让他放心点(,脑子一卡就说了一句“没事我是数学系的”(,挺起腰杆体验卡了属于是。

第一节课非常典,讲 probability theory & Lebesgue measure,挂机了,在写生导思考题(

但还是学到一个之前没怎么注意过的 abuse of notation(写完感觉中英混杂写得好丑):

Product Measure

对于一组 measure space \((\Omega_i, \mathcal F_i,\mu_i)_{i=1}^n\) 存在唯一的 product measure \(\mu_1 \times \mu_2 \times \cdots \times \mu_n\) on \(\Omega_1 \times \Omega_2 \times \cdots \times \Omega_n\) 使得

\[\mu_1 \times \mu_2 \times \cdots \times \mu_n (A_1 \times A_2 \times \cdots \times A_n) = \mu_1(A_1) \mu_2(A_2) \cdots \mu_n(A_n)\]

这是一个非常基本的定理,里面主要可能在记号上有点问题的是一个 \(\sigma-\)field 的写法。

最常用的 product measure 就是在 \(\mathbb R^d\) 上的 Lebesgue product measure,一般我会直接写成 \((\mathbb R^d , \mathcal R^d,m)\),但是注意这里的 \(\mathcal R^d\) 并不是常见的 Cartesian product(显然对于 \(d\)\(\mathcal R\) 直接做 Cartesian product 得到的并不是一个 \(\sigma-\)field,至少对取补是不封闭的),实际上是 \(\sigma(\mathcal R \times \mathcal R \times \cdots \times \mathcal R) = \sigma(\mathcal R^d)\)

更一般的形式可能会产生的问题同理。

但实际上我是没啥误解(,只是之前没想过这个问题。而且我理解 \(\mathcal R^d\) 从来都是从 basis(迫真)的角度理解的,就类似于一维的情况先定义一个 \(\mathcal A = \{(a_1 ,b_1) \times (a_2, b_2) \times \cdots \times (a_n ,b_n) \}\) 然后直接认为 \(\mathcal R^d = \sigma(\mathcal A)\),或者有的时候会把 \(\mathcal R^d\) 写成 \(\mathcal B(\mathbb R^d)\) 的形式,毕竟手写的话其实写快了可能看不太出来我的 \(\mathcal R\)\(\mathbb R\) 有什么区别(

Lecture 2

仍然是实分析和概率论复习,摸了。

有一个简单的小问题和计算问题,也是之前没注意过的,稍微写一下。

Radon-Nikodym Derivative

怎么又是你(,简单回顾一下:

Let \(\nu, \lambda\) be two measures on \((\Omega, \mathcal F)\) and \(\nu\) be \(\sigma\)-finite. If \(\lambda \ll \nu\), then there exists a nonnegative Borel measurable function \(f\) on \(\Omega\) such that \(\lambda(A) = \int_A f \; d \nu\), \(A \in \mathcal F\) holds. Furthermore, \(f\) is unique a.e. \(\nu\).

上次用 R-N Theorem 还是条件期望,但其实这个定理的使用早在随机变量就出现过了,事实上如果 \(P(A) = \int_A f \; d \nu\)\(\int f \; d \nu =1\) 对某个 \(f \geq 0 \; a.e. \nu\) 成立,那么 \(f\) 是概率测度 \(P\) 关于 \(\nu\) 的 probability density function。本质上 \(\nu\) 又是 \(f\) 对应的随机变量 \(X\)\((\mathbb R , \mathcal B)\) 上的测度。

特别地,如果 \(F\) 是随机变量 \(X\) 的分布函数且是绝对连续的,即 \(F(x) = \int_{-\infty} ^x f(y) \; d y, x \in \mathbb R\),则对于 \(F\) 对应的 probability measure \(P\)\(P(A) = \int_A f \; dm\),其中 \(m\)\(\mathbb R\) 上的 Lebesgue measure,这就是概率密度函数的基础。此处的概率密度函数 \(f\) 也称为 \(P\) (or \(F\)) 关于 Lebesgue measure 的 probability density function,也就是 R-N derivative。初学(高等的)概率论的时候我完全没明白这些,满脑子都还是初概那一套积分法,符号一团糟但也完全不在乎,好菜(

有个简单有趣的例子可以在这里写一手,就当熟悉记号了。

Example: Let \(F_i\) be a c.d.f. having a Lebesgue p.d.f. \(f_i\), \(i=1,2\). Assume that there is a \(c \in \mathbb R\) such that \(F_1(c) < F_2(c)\). Define

\[F(x) = \begin{cases} F_1(x) \quad -\infty < x <c \\ F_2(x) \qquad c \leq x < + \infty \end{cases}\]

Show that the probability measure \(P\) corresponding to \(F\) satisfies \(P \ll m + \delta_c\) and find \(dP / d(m+ \delta_c)\).

(in which \(\delta_c(A) = \begin{cases} 1 \quad c \in A \\ 0 \quad c \notin A \end{cases} \quad A \in \mathcal B\).)

Solution: Take

\[f(x)=dP/d(m+\delta_c) = \mathbb 1_{(-\infty, c)} (x) f_1(x)+ \mathbb 1_{(c , +\infty)}(x) f_2(x) + \mathbb 1_{\{c\}}(x) (F_2(c)-F_1(c))\]

and verify that it's the desired expression of Radon-Nikodym derivative.

For any \(A \in \mathcal B\), there is

\[\begin{aligned} \int_A f(x) \; d(m +\delta_c) &= \int_A \mathbb 1_{(-\infty, c)} (x) f_1(x)+ \mathbb 1_{(c , +\infty)}(x) f_2(x) + \mathbb 1_{\{c\}}(x) (F_2(c)-F_1(c)) \; d(m +\delta_c)\\ &= \int_A \mathbb 1_{(-\infty, c)} (x) f_1(x) \; dm +\int_A \mathbb 1_{(c , +\infty)}(x) f_2(x) \; dm + \int_A \mathbb 1_{\{c\}}(x) (F_2(c)-F_1(c)) \; d\delta_c \\ &= \int_A \mathbb 1_{(-\infty, c)} (x) f_1(x) \; dm +\int_A \mathbb 1_{(c , +\infty)}(x) f_2(x) \; dm + (F_2(c)-F_1(c)) \\ &=P(A) \end{aligned}\]

Moreover, \(f(x)\) is a nonnegative Borel function, thus \(P \ll (m + \delta_c)\) and \(dP / d(m+\delta_c)=f(x)\) is the R-N derivative.

Interchange of Differentiation and Integration

就记一笔。

Let \((\Omega, \mathcal F, \nu)\) be a measure space and for any fixed \(\theta \in \mathbb R\), let \(f(\omega,\theta)\) be a Borel function on \(\Omega\). Suppose that \(\partial f(\omega, \theta) / \partial \theta\) exists a.e. for \(\theta \in (a,b) \subset \mathbb R\) and that \(|\partial f(\omega, \theta)| \leq g(\theta)\) a.e., where \(g\) is an integrable function on \(\Omega\). Then, for each \(\delta \in (a,b)\), \(\partial f(\omega, \theta) / \partial \theta\) is integrable and,

\[\frac{d}{d\theta} \int f(\omega,\theta) \; d \nu = \int \frac{\partial f(\omega, \theta)}{\partial \theta} \; d \nu\]

Lecture 3

学期过 \(\frac 1 4\) 了,为什么还在学概率论,这课不是叫数理统计吗(挠头

简单复习几个条件期望的东西。

Conditional Expectation

Let \(X\) be a random \(n\)-vector and \(Y\) a random \(m\)-vector. Suppose that \((X,Y)\) has a joint p.d.f. \(f(x,y)\) w.r.t \(\nu \times \lambda\), where \(\nu\) and \(\lambda\) are \(\sigma\)-finite measures on \((\mathbb R^n, \mathcal B^n)\) and \((\mathbb R^n , \mathcal B^n)\) respectively. Let \(g(x,y)\) be a Borel function on \(\mathbb R^{m+n}\) for which \(E|g(X,Y)|<+\infty\). Then:

\[E[g(X,Y) | Y] = h(Y) = \frac{\int g(x,Y) f(x,Y) d \nu(x)}{\int f(x,Y) d \nu(x)} \quad a.s.\]

这很 trivial,但稍微沾点具体的 p.d.f. 计算的这种问题我就搞不太清楚,只能说 dwl 的初概真的遗害无穷。

Dominated Convergence Theorem

上节概率论 2 的时候讲了一个 lemma:

Suppose \(Y_n \to Y \ a.s.\) and \(|Y_n | \leq Z\) for all \(n\) where \(EZ < \infty\). If \(\mathcal F_n \uparrow \mathcal F_{\infty}\) then \(E(Y_n | \mathcal F_n) \to E(Y|\mathcal F_{\infty}) \quad a.s.\)

其实大致来说就是分成两步证明 \(E(Y_n|\mathcal F_n) \to E(Y|\mathcal F_n)\),然后 \(E(Y|\mathcal F_n) \to E(Y|\mathcal F_{\infty})\),最后套一个三角不等式。前一步其实就可以类似一下 conditional expectation 的 dominated convergence thm。这个东西虽然学条件概率的时候学过但是果然到这种时候还是什么都不记得(

Suppose \(Y_n \to Y \ a.s.\) and \(|Y_n | \leq Z\) for all \(n\) where \(E|Z| < \infty\). Then \(E(Y_n | \mathcal F) \to E(Y|\mathcal F) \quad a.s.\)

Conditional Independence

很混乱(

Definition: We say given \(Y_1\), \(X\) and \(Y_2\) are conditionally independent iff

\[P(A|Y_1,Y_2) = P(A|Y_1) \; \text{a.s. for any }A \in \sigma(A).\]

Lemma: If \(Y_2\) and \((X,Y_1)\) are independent, then given \(Y_1\), \(X\) and \(Y_2\) are conditionally independent. Suppose \(E|X| < +\infty\) then:

\[E(X|Y_1,Y_2) = E(X| Y_1) \; a.s.\]

Proof: First of all, \(E(X|Y_1)\) is measurable in \(\sigma(Y_1,Y_2)\) for \(\sigma(Y_1) \subset \sigma(Y_1,Y_2)\). We only need to show that these two conditional expectations are a.s. the same measurable function from \(\sigma(Y_1,Y_2)\) to \(\mathbb R\).

Then show for any \(B \in \mathcal B^{n_1+ n_2}\),

\[\int_{(Y_1,Y_2) \in B} X dP = \int_{(Y_1,Y_2) \in B} E(X|Y_1,Y_2)dP = \int_{(Y_1,Y_2) \in B} E(X|Y_1) dP\]

Note that

\[\begin{aligned}\int_{(Y_1,Y_2) \in B} E(X|Y_1)dP &= \int E(X|Y_1) \mathbb 1_{(Y_1 \in B_1)} \mathbb 1_{(Y_2 \in B_2)}dP = \int_{Y_1 \in B_1} E(X|Y_1) dP \int \mathbb 1_{(Y_2 \in B_2)}dP \\ & = \int E(X; Y_1 \in B_1) \mathbb 1_{(Y_2 \in B_2)}dP = \int_{(Y_1,Y_2) \in B}X dP \end{aligned}\]

The second equality above is a resslt of independence.

Lecture 4

不是我说不会有人已经把依分布收敛那些个分析结论忘完了吧?(

报一下菜名好了,就,不然也没啥可写的,大部分都学过只是忘了(

Weak Convergence

Analytical Properties

  • (Polya's Theorem) If \(F_n \stackrel{w}{\to} F\) and \(F\) is continuous in \(R^k\), then \(\lim_{n\to \infty} \sup_{x \in R^k}|F_n(x) - F(x)| =0\).

    Proof: Truncate \(-\infty = x_0 < x_1 < \cdots < x_k = +\infty\) s.t. \(F(x_i) = \frac i k\), therefore we can bound \(|F_n(x) - F(x)|\) with finitely many upper bounds.

    for any \(x \in R\), consider \(x_i \leq x < x_{i+1}\), therefore

    \[F_n(x) - F(x) \leq F_n(x_{i+1})- F_n(x_i) = F_n(x_{i+1})- F_n(x_{i+1})+\frac 1 k\]

    \[F_n(x)-F(x) \geq F_n(x_i) - F(x_{i+1}) = F_n(x_i) - F(x_{i})-\frac 1k,\]

    and \(|F_n(x)-F(x)| \leq \max_{i=0,1,\cdots,k} |F_n(x_i) - F(x_i)|+\frac 1k\). By taking \(n \to \infty\) the desired result follows.

  • (Skorohod's Theorem) If \(X_n \stackrel{w}{\to}X\), then there are random vectors \(Y,Y_1,\cdots\) defined on a common probability space such that \(P_Y = P_X\), \(P_{Y_n} = P_{X_n}\) holds for any \(n \in \mathbb Z^+\), and \(Y_n \stackrel{a.s.}{\to} Y\).

  • If \(X_n \stackrel{w}{\to} X\), then there is a subsequence \(\{X_{nj}, j =1,2,\cdots\}\) such that \(X_{nj} \stackrel{a.s.}{\to} X\) as \(j \to \infty\).

Tightness

A sequence \(\{P_n\}\) of probability measures on \((R^k,B^k)\) is tight iff for every \(\varepsilon >0\), there exists a compact set \(C \subset R^k\) s.t. \(\inf_n P_n (C) > 1-\varepsilon\). That is, \(P(C)\) is large uniformly for any \(n\).

If \(\{X_n\}\) is a sequence of random \(k\)-vectors, then the tightness of \(\{P_{X_n}\}\) is the same as the boundedness of \(\{\|X_n\|\}\) in probability.

  • Tightness of \(\{P_n\}\) is a necessary and sufficient condition that for every subsequence \(\{P_{ni}\}\) there exists a further subsequence \(\{P_{nij}\}\) that converges to \(P\) in distribution as \(j \to \infty\).
  • If \(\{P_n\}\) is tight and if each subsequence that converges weakly converges to the same probability measure \(P\), then \(P_n \stackrel{w}{\to} P\).

Characteristic Function

  • \(X_n \stackrel{w}{\to} X\) is equivalent to any of the following conditions:

    • \(E[h(X_n)] \to E[h(X)]\) for every bounded continuous function \(h\).
    • \(\lim \sup_n P_{X_n}(C) \leq P_X(C)\) holds for any closed set \(C \in R^k\).
    • \(\lim \inf_n P_{X_n}(C) \geq P_X(C)\) holds for any open set \(C\in R^k\).
  • (Cramer-Wold) \(X_n \stackrel{d}{\to}X\) iff \(c^TX_n \to c^TX\) for every \(c \in R^k\).

  • (Levy-Cramer) Let \(\phi_X, \phi_{X_1}, \cdots\) be the ch.f.'s of \(X,X_1,X_2,\cdots\) respectively. \(X_n \stackrel{d}{\to} X\) iff \(\lim_{n\to \infty} \phi_{X_n}(t) = \phi_X(t)\) for all \(t \in R^k\).

    The proof should be remembered.

  • (Scheffe) Let \(\{f_n\}\) be a sequence of p.d.f.'s on \(R^k\) w.r.t. a measure \(\nu\). Suppose that \(\lim_{n \to \infty}f_n(x) = f(x)\) a.e.\(\nu\) and \(f(x)\) is a p.d.f. w.r.t. \(\nu\). Then \(\lim_{n \to \infty} \int|f_n(x)-f(x)| d \nu=0\).

Mapping

  • Let \(X,X_1,\cdots\) be random \(k\)-vectors defined on a probability space and \(g\) be a measurable function from \((R^k, B^k)\) to \((R^i, B^i)\). Suppose that \(g\) is continuous a.s \(P_X\), then:
    • \(X_n \stackrel{a.s.}{\to} X\) implies \(g(X_n) \stackrel{a.s.}{\to}g(X)\)
    • \(X_n \stackrel{p}{\to} X\) implies \(g(X_n) \stackrel{p}{\to}g(X)\)
    • \(X_n \stackrel{w}{\to} X\) implies \(g(X_n) \stackrel{w}{\to}g(X)\)
  • (Slutsky) Let \(X,X_1,\cdots, Y, Y_1,\cdots\) be random variables on a probability space. Suppose \(X_n \stackrel{w}{\to} X\) and \(Y_n \stackrel{w}{\to} c\), where \(c\) is a real number. Then
    • \(X_n + Y_n \stackrel{w}{\to} X_n +c\)
    • \(X_nY_n \stackrel{w}{\to} cX_n\)
    • \(X_n / Y_n \stackrel{w}{\to} X_n /c\) if \(c \neq 0\)

Notations from Calculus

就,我真的不是很喜欢大 O 小 o 这些,但是就是很常用(

一般来说对于数列也有这样的记号,如果两个数列 \(\{a_n\}, \{b_n\}\) 满足 \(|a_n| \leq C|b_n|\) 对任意的 \(n\) 成立,那么称 \(a_n = O(b_n)\);如果满足 \(a_n / b_n \to \infty, n\to \infty\),则称 \(a_n = o(b_n)\)。在概率里也有类似的用法:

Let \(X_1, X_2, \cdots\) be random vectors and \(Y_1,Y_2,\cdots\) be random variables defined on a common probability space.

  • \(X_n = O(Y_n)\) a.s. iff \(P(\omega: \|X_n(\omega)\| \leq C|Y_n(\omega)| \text{ holds for any }n \geq 1)=1\)
  • \(X_n = o(Y_n)\) a.s. iff \(X_n / Y_n \stackrel{a.s.}{\to} 0\).
  • \(X_n = O_p(Y_n)\) iff for any \(\varepsilon>0\), there is a constant \(C_\varepsilon>0\) s.t. \(\sup_n P(\omega : \|X_n(\omega) \| \geq C_\varepsilon |Y_n(\omega)|) <\varepsilon\)
  • \(X_n = o_p(Y_n)\) iff \(X_n / Y_n \stackrel{p}{\to} 0\).

Therefore

  • \(X_n = o_p(Y_n)\) implies \(X_n = O_p(Y_n)\)
  • \(X_n = O_p(Y_n)\) and \(Y_n = O_p(Z_n)\) implies \(X_n = O_p(Z_n)\)
  • \(X_n = O_p(Y_n)\) does not imply \(Y_n = O_p(X_n)\)
  • The same conclusions can be obtained if \(O_p(\cdot),o_p(\cdot)\) are replaced by \(O(\cdot)\) a.s., \(o(\cdot)\) a.s..
  • If \(X_n \stackrel{w}{\to} X\) then \(X_n = O_p(1)\)
  • Since \(a_n = O(1)\) means \(\{a_n\}\) is bounded, \(\{X_n\}\) is said to be bounded in probability iff \(X_n = O_p(1)\).

Delta Method

Let \(X_1,X_2,\cdots\) and \(Y\) be random \(k\)-vectors satisfying \(a_n(X_n - c) \stackrel{w}{\to}Y\), where \(c \in R^k\) and \(\{a_n\}\) is a sequence of positive numbers with \(\lim_{n \to \infty} a_n =\infty\). Let \(g\) be a function from \(R^k\) to \(R\).

  • If \(g\) is differentiable at \(c\), then \(a_n [g(X_n) - g(c)] \stackrel{w}{\to} [\nabla g(c)]^TY\).

  • Suppose \(g\) has continuous partial derivatives of order \(m >1\) in a neighborhood of \(c\), with all the partial derivatives of order \(j\), \(1 \leq j \leq m-1\) vanishing at \(c\), but with the \(m\)-th order partial derivatives not all vanishing at \(c\). Then

    \[a_n^m [g(X_n)-g(c)] \stackrel{w}{\to} \frac{1}{m!} \sum_{i_1=1}^k \cdots \sum_{i_m=1}^k \frac{\partial^m g}{\partial x_{i_1} \cdots \partial x_{i_m}}\mid _{x=c} Y_{i_1}\cdots Y_{i_m} \]

  • If \(Y\) has the \(N_k(0,\Sigma)\) distribution, then \(a_n [g(X_n) - g(c)] \stackrel{w}{\to} N(0,[\nabla g(c)]^T\Sigma \nabla g(c))\)

我很可爱 请给我钱(?)

欢迎关注我的其它发布渠道