Chapter-1-Regression

1.Linear regression

1.1 Preliminaries: linear algebra & matrix calculate

Vectors: a= $a_{1} a_{2} ⋮ a_{n} \in R^{n}$ (Vector space)
Inner product: $a, b \in R^{n}, a^{'} b = \sum_{i = 1}^{n} a_{i} b_{i}$ (Euclidean space)
Norm: $∥ a ∥_{2} = (a^{'} a)^{1/2} = (\sum_{i = 1}^{n} a_{i}^{2})^{1/2}$

1.∥ a ∥_{2} \geq 0,^{'} =^{'} i f an d o n l y i f a = 0 2.∥ a + b ∥_{2} \leq ∥ a ∥_{2} + ∥ b ∥_{2}, a, b \in R^{n} 3.∥ λa ∥_{2} = ∣ λ ∣∥ a ∥_{2}, λ \in R^{n}, a \in R^{n}

$\to C a u c h y - S c h w a rz in e q u a l i t y : ∣ a^{'} b ∣ \leq ∥ a ∥_{2} ∥ b ∥_{2}$

P roo f : l e t f (x) = i = 1 \sum n (a_{i} x - b_{i})^{2} = (i = 1 \sum n a_{i}^{2}) x^{2} - 2 (i = 1 \sum n a_{i} b_{i}) + (i = 1 \sum n b_{i}^{2}) f or f (x) \geq 0 f or a ll x \in R i t f o ll o w t ha t t h e d i scr iminan t Δ = 4 (i = 1 \sum n a_{i} b_{i})^{2} - 4 (i = 1 \sum n a_{i}^{2}) (i = 1 \sum n b_{i}^{2}) \leq 0 ∣ a^{'} b ∣ \leq ∥ a ∥_{2} ∥ b ∥_{2}

Distance: $d (a, b) = ∥ a - b ∥_{2}$

1. d (a, b) \geq 0,^{'} =^{'} i f an d o n l y i f a = b 2. d (a, b) + d (b, c) \geq d (a, c) 3. d (a, b) = d (b, a) f or a, b \in R^{n}

Matrix: $A = (a_{ij}) \in R^{n \times m}, B = (b_{ij}) \in R^{n \times m}$ $\Rightarrow A + B = (a_{ij} + b_{ij})$ If $B \in R^{m \times p}$ ,we can define AB $A^{'} = (a_{ji})$ $\Rightarrow (A B)^{'} = B^{'} A^{'}$
Square Matrix (n=m) $\to A^{- 1}, (A B)^{- 1} = B^{- 1} A^{- 1}$ $(A^{'})^{- 1} = (A^{- 1})^{'}$ → det(A),det(AB)=det(A)det(B) → tr(A), tr(AB) = tr(BA)

Symmetric matrix $(A = A^{'})$ : there exists a matrix $U \in R^{n \times n}$ satisfying $U^{'} U = I$ such that $A = U^{'} λ_{1} λ_{2} ⋱ λ_{n} U .$ [spectral decomposition]

(i) The $λ_{i}$ are called eigenvalues of A.

(ii) → A is called positive definite (p.d.) if $λ_{i}$ are all positive $\Leftrightarrow a^{'} A a > 0$ for all nonzero a ∈ ℝⁿ.

→ A is called positive semi-definite (p.s.d.) if $λ_{i} \geq 0, f or i = 1, ..., n$ . $\Leftrightarrow a^{'} A a \geq 0$ for all a ∈ ℝⁿ.
Inverse of a partitioned matrix.
- Let M = $(E G F H)$ . Suppose that H is invertible. Then

$M^{- 1} = (S^{- 1} - H^{- 1} G S^{- 1} - S^{- 1} F H^{- 1} H^{- 1} + H^{- 1} G S^{- 1} F H^{- 1})$

where $S = E - F H^{T} G$ is called the Schur complement of $M$ with respect to (w.r.t.) $H$ .

Proof: (I 0 - F H^{'} I) (E G F H) (I - H^{- 1} G 0 I) = (E - F H^{- 1} G 0 0 H)

Taking inverse on both sides, we obtain that

(I - H^{- 1} G 0 I)^{- 1} (E G F H)^{- 1} (I 0 - F H^{- 1} I) = (S^{- 1} 0 0 H^{- 1})

\Rightarrow (E G F H)^{- 1} = (I - H^{- 1} G 0 I) (S^{- 1} 0 0 H^{- 1}) (I 0 - F H^{- 1} I)

= (S^{- 1} - H^{- 1} G S^{- 1} - S^{- 1} F H^{- 1} H^{- 1} + H^{- 1} G S^{- 1} F H^{- 1})

Similarly, if $E$ is invertible, then

(E G F H)^{- 1} = (E^{- 1} + E^{- 1} F R^{- 1} G E^{- 1} - R^{- 1} G E^{- 1} - E^{- 1} F R^{- 1} R^{- 1})

where $R = H - GEF$ is the Schur complement of $M$ w.r.t. $E$ .

If both $E$ & $H$ are invertible, then we have

S^{- 1} = E^{- 1} + E^{- 1} F R^{- 1} G E^{- 1} \Leftrightarrow (E - F H^{- 1} G)^{- 1} = E^{- 1} + E^{- 1} F (H - G E^{- 1} F)^{- 1} G E^{- 1} .

In particular, let $F = u$ , $H = - I$ , $G = v^{T}$ . Then

(E + u v^{'})^{- 1} = E^{- 1} - \frac{E ^{- 1} u v ^{'} E ^{- 1}}{1 + v ^{'} E ^{- 1} u} .

(rank-one update of an inverse matrix)

Let $f : R^{n} \to R$ . Then

\frac{\partial f ( x )}{\partial x} = \frac{\partial f ( x )}{\partial x _{1}} ⋮ \frac{\partial f ( x )}{\partial x _{n}}

We have that

{\frac{\partial ( b ^{'} a )}{\partial a} = b \frac{\partial ( a ^{'} A a )}{\partial a} = (A + A^{'}) a = if A = A^{'} 2 A a .

1.2 Linear models and least squares estimators}

Regression: goal is to predict $y \in R$ based on the input/feature $x = (x_{1}, \dots, x_{p})^{⊤} \in R^{p}$ .
A linear model assumes that

y = w_{0} + w_{1} x_{1} + \dots + w_{p} x_{p} + ε,

where $w_{0}, w_{1}, \dots, w_{p}$ are unknown parameters, and $ε$ is an error term.

Let $w = (w_{0}, w_{1}, \dots, w_{p})^{⊤}$ , and $x = (1, x_{1}, \dots, x_{p})^{⊤}$ . Then model (1) can be written as

y = x^{⊤} w + ε .

Given $w$ , our prediction $\overset{y}{^}$ for $y$ at $x$ is

\overset{y}{^} = x^{⊤} w .

We can define a loss function $ℓ (y, \overset{y}{^})$ which measures the penalty paid for predicting $\overset{y}{^}$ when the true value is $y$ . e.g.,

ℓ (y, \overset{y}{^}) = (y - \overset{y}{^})^{2} squared loss .

Idea: find an estimator $w$ such that the expected loss

R (w) = E_{(x, y) \sim p (x, y)} [ℓ (y, \overset{y}{^})] .

Now suppose we are given a labeled training set $(x_{i}, y_{i})$ $(i = 1, \dots, N)$ which are iid sampled from $p (x, y)$ .
By law of large numbers (LLN), $R (w)$ can be approximated by

L (w) = \frac{1}{N} i = 1 \sum N ℓ (y_{i}, \overset{y}{^}_{i}) = \frac{1}{N} i = 1 \sum N (y_{i} - \overset{y}{^}_{i})^{2} .

Let $Y = (y_{1}, \dots, y_{N})^{⊤}$ and

X = 11 ⋮ 1 x_{11} x_{21} ⋮ x_{N 1} x_{12} x_{22} ⋮ x_{N 2} \dots \dots \dots x_{1 p} x_{2 p} ⋮ x_{Np} .

Then

L (w) = \frac{1}{N} \cdot ∥ Y - Xw ∥_{2}^{2} .

In other words, our estimator $\overset{w}{^}$ should be such that $X \overset{w}{^}$ is closest to $Y$ . That is,

∥ Y - X \overset{w}{^} ∥_{2}^{2} \leq ∥ Y - Xw ∥_{2}^{2} for any w \in R^{m} .

Obviously, such a $\overset{w}{^}$ best explains the data and is called a least squares estimator.

Theorem 1.

(i) A $\overset{w}{^}$ is an LSE iff $(X^{⊤} X) \overset{w}{^} = X^{⊤} Y$ .

(ii) If $X^{⊤} X$ has full rank, then the LSE is unique and is given by $\overset{w}{^} = (X^{⊤} X)^{- 1} X^{⊤} Y$ .

$Proof.$ Obviously, part (ii) follows directly from part (i).

The 1st pf for part (i):

L (w) = \frac{1}{N} ∥ Y - Xw ∥_{2}^{2} = \frac{1}{N} (Y - Xw)^{⊤} (Y - Xw) = \frac{1}{N} [Y^{⊤} Y - 2 Y^{⊤} Xw + w^{⊤} X^{⊤} Xw] .

\Rightarrow \frac{\partial L ( w )}{\partial w} = \frac{1}{N} [- 2 X^{⊤} Y + 2 X^{⊤} Xw] = 0 \Leftrightarrow X^{⊤} Xw = X^{⊤} Y .

The 2nd pf for part (ii):

(Sufficiency $\Leftarrow$ ). Let $\overset{w}{^}$ satisfy $(X^{⊤} X) \overset{w}{^} = X^{⊤} Y$ .

∥ Y - X \overset{w}{^} ∥_{2}^{2} = ∥ Y - X \overset{w}{^} + X \overset{w}{^} - X \overset{w}{^} ∥_{2}^{2} = ∥ Y - X \overset{w}{^} ∥_{2}^{2} + ∥ X \overset{w}{^} - X \overset{w}{^} ∥_{2}^{2},

as $(X \overset{w}{^} - X \overset{w}{^})^{⊤} (Y - X \overset{w}{^}) = (\overset{w}{^} - \overset{w}{^})^{⊤} (X^{⊤} Y - X^{⊤} X \overset{w}{^}) = 0$ .

\Rightarrow ∥ Y - Xw ∥_{2}^{2} \geq ∥ Y - X \overset{w}{^} ∥_{2}^{2} .

(Necessity $\Rightarrow$ ) Let $\tilde{w}$ be another LSE, that is,it minimize $∥ Y - Xw ∥_{2}^{2}$ Then we must have that $∥ X \overset{w}{^} - X \tilde{w} ∥_{2}^{2} = 0$ implying that $∥ X \overset{w}{^} = X \tilde{w} ∥_{2}^{2}$ and thus

X^{⊤} X \tilde{w} = X^{⊤} X \overset{w}{^} = X Y .

🗂️Notebook SET

Explorer

Chapter-1-Regression

1.Linear regression

1.1 Preliminaries: linear algebra & matrix calculate

1.2 Linear models and least squares estimators}

Graph View

Backlinks