2019-05-04

18.6501x Bayesian Statistics（Unit5）チェックリスト

edX 統計学データサイエンスベイズ MITx

What you learned

Lec 17: Introduction to Bayesian Statistics

frequentist
- 古典的な統計学。Unit4までやってきた統計。Bayesianに対する言葉。
frequentist vs bayesian
- bayesian
  - 特徴
    - prior beliefを具現化したprior distributionをdataでupdateして、posterior distributionを得る
  - true parameter
    - r.v or unccertanity regarding the true parameter
  - specifyするもの
    - set of possible parameter
    - prior distribution π(theta)
- frequentist
  - 特徴
    - dataからのみ推定
    - true parameter thetaをfixして推定する (MLE,MM,M-estimation)
  - true parameter
    - r.vではない
  - specifyするもの
    - statical model for the observation
      - set of possible parameter
      - probability model
Beta distribution
- 統計の道具としてのベータ分布
- 針金細工のような分布
- prior beliefを反映した分布を表現するのに便利

ja.wikipedia.org

priorのデザイン
- 確率pがパラメーターであれば、uniform,betaなど
- prior beliefを反映した分布を選択する
prior and posterior
- prior
  - 慣習的にπで表すことが多い
- data/experiment
  - Ln(|theta)はconditional joint liklihood.つまり、conditional joint pdf/pmf. thetaをfix.
  - これは、frequentistのlikelihoodと同じ
- posterior
  - データX1,,,,,,Xnの条件付きのthetaの分布をposteriorと呼ぶ
  - likelihoodにpriorをかけたものに比例する(not normarization).proportional notationで表される。
  - normarizationはposteriorが1になるようなな定数
- you have likelihood
  - frequentist
    - maximize this thing
  - baysian
    - multiple a prior to likelihood and I have a posterior
no imformative priors
- 事前情報がない場合でもBayesianアプローチは使える。その場合はpriorをどのように選択すればよいか？
  1. constant pdf : π(θ) ∝ 1
  2. boundedの場合 : uniform
  3. unboundedの場合 : properなpdfを定義できない
    - improper prior : not integrableなπ(θ).つまり、積分したら数値に収束せず発散してしまう関数。measurable, non-negative function
    - improperでもBaysianのstepは適用できる。

What you noticed

priorのπ(theta)の分布を見るときに注意。thetaの分布なのでthetaがxと入れ替わる。parameterと勘違いしないようにする
proporthional notationに慣れる。基本的にパラメーターに依存しない項は除いてシンプルな形にして考える
proportionality notation in the process of computing the posterior distribution for a parameter of interest proportionality notationが結構重要

Lec 18: Jeffrey's Prior and Bayesian Confidence Interval

Explain the important factors involved in choosing a prior distribution.
- Bernoulli experimentの場合
  - prior
    1. Beta(a,a) : informativeの時。何かしら実験前に事前情報がある場合
      - 確率を表す1 parameterの分布を表すの適している
    2. Uniform : non-infomativeの時
      - MLE = Maximum a posteriorになる
Distinguish between conjugate priors and non-conjugate priors .
- conjugate : priorとposteriorの分布同じ分布族であるとき
  - 特にBeta分布はBayesianに適した分布。BetaはposteriorもBeta分布になる?　　
Compute Jeffreys Prior and understand the intuition behind its significance.
- Jeffreys Prior
  - Def
    - πj(θ) ∝ √detI(θ)
      - fisher infoで定義される。d=1の時は単にfisher info root squared.
  - お気持ち
    - これもnon-informativeの時のprior
    - データ(observation)のstatical model(分布)に関連したpriorを定義しとけば何かと便利そうじゃないという感じ？
    - experimentの分布でpriorが決まる（決めちゃう）
    - This prior depends on the statistical model used for the observation data and the likelihood function.
  - property
    1. The Jeffreys prior gives more weight to values of theta whose MLE estimate has less uncertainty.
    2. As a result, the Jeffreys prior yields more weight to values of theta where the data has more information towards deciding the parameter.
    3. The Fisher information can be taken as a proxy for how much, at a particular parameter value theta, would equivalent shifts to the parameter influence the data. Thus, Jeffreys prior gives more weight to regions where the potential outcomes are more sensitive to theta slight changes in .
  - ↑の話はなんとなく共振回路のq値的な話と似てるかも。shapeのシャープさがsenstivieに関わるところ。
  - つまり、fisher-infoが大きいほど、senstiveなJeffreys priorになる
  - reparamaetrization invariance(パラメーター付け替え不変)
    - まだ理解できていない
    - prameterを媒介変数表示した時に、Jeffreys priorは媒介変数で置換しても不変
    - Jeffereys priorをただ媒介変数で置換しただけではだめ。媒介変数でのfisher-infoを求め直す必要がある。その際に元のパラメーターを媒介変数で微分する項が出てくるなど変換には注意。（と言っても高校数学レベルの話）
Apply Bayesian statistics in simple estimation and inference problems.
- Bayesian confidence region
  - これはfrequentistのC.IとBayesian confidence regionは明確に異なる概念
  - posteriorのparameter spaceのrandom subset RがBayesian confidence region
  - 求め方は簡単で、posteriorから1-αの区間（なので、いまいちC.Iとの明確な違いがわからない）
  - あと、Rはpriorい依存する
- Bayesian estimation
  - Bayesian Frameworkでも、frequentistでやったようにパラメーター推定ができる
  - [1] Bayes estimator
    - posteriorをpdf/pmfとしたパラメーターの期待値
    - つまり、posterior mean
    - priorに依存する
    - 実際の計算
      1. そのまま積分をする。(しかし大概は手計算は厳しいことが多い)
      2. posteriorがfamousなdistributionだと見抜く
        
        posteriorがBeta分布やGamma分布であることが多い。その場合はBeta分布やGamma分布のパラメーターに対応する値を抜き出して、各々の期待値の形に代入して求める（問題ではこのパターンが一番多かった）
  - [2] Maximum a posteriori(MAP)
    - posteriorを最大にするパラメーター
Compare and contrast results from Bayesian and frequentist statistical methods.

What you noticed

Bayesでは、Beta分布、Gamma分布にお世話になることが多い
proper or improperの判別は、parameter spaceで積分して収束するかしないかで判別。収束しないと正規化できない。
inverse Gammaだと?!ってなった。気づかなかった

en.wikipedia.org

その他

Beta,Gamma関数出てくると、すぐにstring theoryの教科書とか出してくるから、弦理論ちょっと読みたくなった。
以下は、参考文献ではなく読みたいなという本（ちょうどMITだし、学部レベルの量子力学、電磁気学程度の知識で読めるらしい。）

参考文献

初級講座弦理論基礎編

作者: B.ツヴィーバッハ,Barton Zwiebach,樺沢宇紀
出版社/メーカー: 丸善プラネット
発売日: 2013/09/01
メディア: 単行本
この商品を含むブログを見る

2019-03-22

18.6501x Fundamentals of Statistics（Unit3）チェックリスト

MITx データサイエンス統計学 edX

Unit3 Methods for estimation

What you learned

Lec8: Distance measures between distributions

Unit2までは、estimatorをsample aveとして直感的に決めてきた
今回は、最適なestimatorを決める手法を学ぶ
今までは、sample aveの期待値がLLNよりパラメーターに収束する場合だけしか、ほぼ扱ってこなかった。
そのため、sample aveはestimatorとして活用できた。
しかし、パラメーターの値に収束しない場合は、estimatorは何にすればよいだろうか？
大きく、次の３つの方法が考えられる。
1. Maximum likelihood estimation(最尤法)
2. Method of moments
3. M-estimators
Total variation distance(TV)
- これはいわゆる、距離
Kullback-Leibler divergence(KL)
- 相対エントロピーとしても有名
- 確率測度間の距離を最小化問題は、KLを使って考える
- TVは距離であったが、KLは距離の定義を満たさないので距離ではない、divergenceと呼ばれる
- KLの最小 ⇔ likelihoodの最大値。This is the maximum likelihood principle
Likelihood
- データとパラメーターを引数にとる関数
- 値は確率or確率密度と考えていい。joint pmf or joint pdf

Lec9: Introduction to Maximum Likelihood Estimation

以下の確率変数のlikelihoodを計算
- Bernoulli
- Poisson
- Gussian
- Exponenssial
- Uniform
Maximum likelihood estimator(MLE)
- log-likelihood estimatorは実際に計算するときに便利なのでよく使う
一般的な教科書はだいたいminimizingで書かれているが、この授業ではmaximizingで進める
concave(上に凸)/convex(下に凸)の判定
- gradientの導入
- Hessian matrixの導入
- Hessian matrixからconcave/convexを判定
実際にMLEを計算
- Bernoulli
- Poisson
- Gaussian

What you noticed

concave/convexの判定は、Hessian matrixから数値化、または固有値を求めて判断
MLEを計算する際は、log-likelihoodを使うことが大半。便利だから。
パラメーターが１つの場合は、高校数学と同じ。

その他

ベクトル解析を少し復習が必要
- gradient(∇)
- ベクトル場、スカラー場

参考文献

自然科学の統計学 (基礎統計学)

作者: 東京大学教養学部統計学教室
出版社/メーカー: 東京大学出版会
発売日: 1992/08/01
メディア: 単行本
購入: 26人クリック: 308回
この商品を含むブログ (22件) を見る

プログラミングのための線形代数

作者: 平岡和幸,堀玄
出版社/メーカー: オーム社
発売日: 2004/10/01
メディア: 単行本
購入: 27人クリック: 278回
この商品を含むブログ (90件) を見る

2019-02-20

18.6501x Fundamentals of Statistics（Unit1-2）チェックリスト

MITx 統計学データサイエンス edX

Unit1 Introduction to Statistics

What you learned

Lec1: What is statistics

Lec2: Probability Redux

Sample average
- estimatorとして使う
probabilistic tools
1. LLN(Laws(weak and strong) of large numbers)
  - a.s. convergence
  - Convergence in probability
2. CLT(Central limit theorem)
  - Convergence in distribution
3. Hoeffinding's inequality
  - sample size nが小さくても使える。（n=1でもいい）
  - CLTが使えない時の代り、ただし精度はCLTほどでない
4. Consistent estimator
5. Gaussian distribution
  - PDF, CDF
  - Affine transformation
  - Standardization
  - Symmetry
  - Table(CDF of Standard normal distribution)
  - Quantiles
6. Three types of convergence
  1. Almost surely(a.s.) convergence
  2. Convergence in probability
  3. Convergence in distribution
7. Addition, multiplication, division
  - Almost surely(a.s.) convergence and Convergence in probability
8. Addition, multiplication, division (Slutsky's theorem)
  - Convergence in distribution
9. Continuous mapping theorem

What you noticed

sample averageにCLTを適用することで、Gaussian distributionに分布収束する。その際sampleのr.v.はGaussianである必要はない、任意の分布のr.v.でも大丈夫
sample sizeが小さくてCLT適用できない時は、Hoeffinding's inequality
CLTもHoeffinding's inequalityもestimatorであるsample averageがunknownな母集団の期待値にどれくれい近いかを測るために使う

その他

線形代数の復習が必要
- 行列の積
- 内積、外積
- 一次独立、一次従属
- ランク、ランクの求め方
  - 面倒なときは、wolframalphaを使おう www.wolframalpha.com

参考文献

a.s. ja.wikipedia.org
Hoeffinding's inequality seetheworld1992.hatenablog.com
確率収束について kriver-1.hatenablog.com

Unit2 Parametric Inference

What you learned

Lec3: Parametric Statistical Models

Trinity of statistical inference
1. Estimation
2. Confidence intervals
3. Hypothesis testing
The goal of statistics is to learn the distribution of r.v
discrete r.v.s

ja.wikipedia.org

statistical model is a pair of sample space and a family of probilty distributions.
well specified
parametric
non-parametric
semi parametric is a hybrid model
- nuisance parameter (撹乱母数、迷惑母数)
Linear regression model (線形回帰モデル)
Cox proportional Hazard model (コックス比例ハザードモデル) 生存モデル
identifiable

Lec4: Parametric Estimation and Confidence Intervals

Definitions
- Statistic
  - Any measurable function of the sample
  - Rule of thumb : if you can compute it exactly once given data, it is measurable.
- Estimator of theta
  - Any statistic whose expression does not depend on theta(data)
- weakly (resp. strongly) consistent estimatorの条件
- asymptotically normalの条件
  - estimatorはr.v. そのestimatorも正規分布に近似できる。
  - 近似した際の、分散をasymptotic variance
Bias of an estimator
Risk (or quadratic risk)
- varianceとbiasを求めて、これを求めるという流れ
- MSEと同じ意味合いだけと思うけど、言葉は区別した方いいのかな
Confidence intervals(C.I.)
- confidence interval of level 1 - alpha for theta
  - any random interval whose boundaries do not depend on theta
  - true value theta が、interval内である確率が1 - alpha 以上のintervalのこと
- C.I. of asymptotic level 1 - alpha for theta
  - any random interval whose boundaries do not depend on theta
  - sample size nの極限を取った時に、上記のような条件を満たすintervalのこと
A confidence interval for the kiss example
- sample spaceの分布がBer(p)の場合
- CLTより、estimator(sample ave)を標準正規分布に近似がスタート
- 標準正規分布への近似だけでは、完璧なC.I.は求まらない。なぜならパラメーターに依存した形だから。（今回の場合は、true value p）
- 次の３つの方法で求める
  1. Solution 1. Conservative bound
  2. Solution 2. Solving the (quadratic) equation for p
    - 実際は、解の公式よりコンピューター計算
  3. Solution 3. plug-in
    - Slutskyより、true vale pの代りにestimatorをplug-inして求める

What you noticed

どの分布が適切かを選択するのが、statistical modelingの第一歩
その際に、離散な確率変数であれば「台」に注目するのもポイント。有限個なのか無限個なのか

Lec5: Delta Method and Confidence Intervals

C.I.の復習
- 95%,98%の区間があるからといって、必ずしも98%区間の方が広いわけではない
- 同じ50%のC.I.でも区間の広さは異なる。正規分布の形から区間の中点を正規分布の中心に持ってくる時、一番区間を小さくできる
- n → ∞にした時に成立するものをasymptotic confidence intervalと呼ぶ。（つまりn=1の時などは成立しない）
- [0.34, 0.57]が95% confidence interval、と言われた時どう捉えるか？
  - この区間にunknownなパラメーターpが入る確率は0,1。0.95ではない。
  - realizationしたC.I.には注意
  - それでも、[0.34, 0.57]を95%のC.I.と呼ぶので注意。
  - これはあくまでも1 - alpha = 0.95でrandom C.I.をdeterministicな区間にrealizationしただけ
Red line TのKenall stでの待ち時間のモデル（delta method）
- 電車の到着間の時間を計測する（つまり次の電車が来るまでの待ち時間）
- この各待ち時間をモデル化する
- 以下の様に仮定する
  - Mutually independent
  - パラメーターlambdaの指数分布
- この時、lambdaをestimateする
- lack of memory
  - why would I use exponential?
    - It's a very common distribution for inter-arrival times
    - main reason "lack of memory"
- exponentialのexpectationからわかるように、LLN -> CLTを適用しても、単純にsample aveをestimatorにしてただけではlambdaのestimateできない
- ここで、delta methodの登場

ja.wikipedia.org

delta method
- this is important
- 確率変数の列がthetaで正規分布に分布収束するとする
  - この時、この列をasymptotically normal around thetaと言う
- 次に、thetaでcontinuously differentiableな関数gを考える
- 上記の確率変数の列をこの関数に関しても、正規分布へ分布収束する
- delta methodの導出にはtaylor展開を使う
- 指数分布の場合は、estimatorをsample aveの逆数を取る。このestimateの時にLNN,CLTに加えてdelta methodを使う
frequentist interpretation
- 複数回試行を行ったとき、true value lambdaがC.I.に入る確率は95%
- 1111011101111..のような結果になる。

What you noticed

パラメーターに依存するrandom intervalは実際はC.I.ではない
3つのsolutionを用いて、数値化(realizations)したendpoint間のintervalがC.I.
このように、まずC.I.はrandomなのか、realizationしたdeterministicな区間なのかをまず区別する
HW2より。正規分布の確率変数の列の和も正規分布になる
HW2でガウス分布登場

ja.wikipedia.org