1.Linear regression

1.1 Preliminaries: linear algebra & matrix calculate
  • Vectors: a= (Vector space)
  • Inner product: (Euclidean space)
  • Norm:

  • Distance:

  • Matrix: If ,we can define AB

  • Square Matrix (n=m) → det(A),det(AB)=det(A)det(B) → tr(A), tr(AB) = tr(BA)

  • Symmetric matrix : there exists a matrix satisfying such that [spectral decomposition]

    (i) The are called eigenvalues of A.

    (ii) → A is called positive definite (p.d.) if are all positive for all nonzero a ∈ ℝⁿ.

    → A is called positive semi-definite (p.s.d.) if . for all a ∈ ℝⁿ.

  • Inverse of a partitioned matrix.

    • Let M =. Suppose that H is invertible. Then

where is called the Schur complement of with respect to (w.r.t.) .

Taking inverse on both sides, we obtain that

  • Similarly, if is invertible, then

where is the Schur complement of w.r.t. .

If both & are invertible, then we have

In particular, let , , . Then

(rank-one update of an inverse matrix)


  • Let . Then

We have that

1.2 Linear models and least squares estimators}
  • Regression: goal is to predict based on the input/feature .

  • A linear model assumes that

where are unknown parameters, and is an error term.

  • Let , and . Then model (1) can be written as
  • Given , our prediction for at is
  • We can define a loss function which measures the penalty paid for predicting when the true value is . e.g.,
  • Idea: find an estimator such that the expected loss
  • Now suppose we are given a labeled training set which are iid sampled from .

  • By law of large numbers (LLN), can be approximated by

  • Let and

Then

  • In other words, our estimator should be such that is closest to . That is,
  • Obviously, such a best explains the data and is called a least squares estimator.

Theorem 1.

(i) A is an LSE iff .

(ii) If has full rank, then the LSE is unique and is given by .

Obviously, part (ii) follows directly from part (i).

The 1st pf for part (i):


The 2nd pf for part (ii):

(Sufficiency ). Let satisfy .

as .

(Necessity ) Let be another LSE, that is,it minimize Then we must have that implying that and thus