# Linear AlgebraSingular Value Decomposition

In this section we will develop one of the most powerful ideas in linear algebra: the singular value decomposition. The first step on this journey is the polar decomposition.

## Polar decomposition

The Gram matrix of a square matrix is a useful tool for understanding the behavior of . Let's define the matrix to be , where is the diagonalization of and is the matrix obtained by taking the square root of the diagonal entries of . Then is and satisfies

as befits the notation .

The matrix is simpler to understand than because it is symmetric and positive semidefinite, yet it transforms space very similarly to : if , then

In other words, for all , the images of under and under have equal norm. If points are the same distance from the origin, then one may be mapped to the other by a about the origin.

Therefore, for each , there is an orthogonal transformation from the range of to the range of which sends to .

It can be shown that the orthogonal transformation mapping to is the same transformation for every . Furthermore, even if the range of is not all of , we can extend this orthogonal transformation to an orthogonal transformation on by arbitrarily mapping vectors in a basis of the orthogonal complement of the range of to the orthogonal complement of the range of . In this way, we obtain the polar decomposition:

Theorem (Polar Decomposition)
For any matrix , there exists an orthogonal matrix such that

This representation is useful because it represents an arbitrary square matrix as a product of matrices whose properties are easier to understand (the orthogonal matrix because it is distance- and angle-preserving, and the positive-definite matrix because it is orthogonally diagonalizable, by the Spectral Theorem.

Exercise
Let's explore a fast method of computing a polar decomposition . This method actually works by calculating and then recovering as (since this is computationally faster than calculating the matrix square root). We call the orthogonal part of and the symmetric part of .

We set and define the iteration

Let's see why this converges to .

1. Defining and using the equation , show that Use the prior step to explain why the 's all have the same orthogonal parts and have symmetric parts converging to the identity matrix.
Hint: consider the eigendecompositions of the symmetric parts. You may assume that the sequence defined by converges to 1 regardless of the starting value .

2. Write some code to apply this algorithm to the matrix

import numpy as np
A = np.array([[1,3,4],[7,-2,5],[-3,4,11]])
A = [1 3 4; 7 -2 5; -3 4 11]

and confirm that the resulting matrices and satisfy and .

Solution. Since both conjugation and inversion reverse the order of matrix products, we get

Since is orthogonal, , as . Since is symmetric, . So this is equal to

We see that the and have the same orthogonal part, and repeating the calculation shows that all the have the same orthogonal part. As for the symmetric parts , we see that

Let's see why this averaging process converges to the identity matrix. Symmetric matrices are diagonalizable by the Spectral Theorem, so suppose diagonalizes as . Then

Thus the 's converge to the matrix , where is the diagonal matrix whose th entry is the limit obtained when you start with and repeatedly apply the function . By the fact about this iteration given in the problem statement, we conclude that is the identity matrix. Therefore, the limit of as is equal to .

For example:

import numpy as np

def polar(A,n):
R = A
for i in range(n):
R = (R + np.linalg.inv(R.T))/2
return R, np.linalg.solve(R, A)

I = np.eye(3)
A = np.array([[1, 3, 4],[7, -2, 5], [-3, 4, 11]])
R, P = polar(A,100)
R.T @ R - I, P @ P - A.T @ A
using LinearAlgebra

function polar(A,n)
R = A
for i=1:n
R = (R + inv(R'))/2
end
R, R \ A
end

A = [1 3 4; 7 -2 5; -3 4 11]
R, P = polar(A,100)
R'*R - I, P^2 - A'*A

Both of the matrices returned on the last line have entries which are within of zero.

Exercise
Show that the product of two matrices with orthonormal columns has orthonormal columns.

Solution. If and , then .

## The singular value decomposition

The polar decomposition tells us that any square matrix is almost the same as some symmetric matrix, and the spectral theorem tells us that a symmetric matrix is almost the same as a simple scaling along the coordinate axes. (In both cases, the phrase "almost the same" disguises a composition with an orthogonal transformation.) We should be able to combine these ideas to conclude that any square matrix is basically the same as a simple scaling along the coordinate axes!

Let's be more precise. Suppose that is a square matrix. The polar decomposition tells us that

for some orthogonal matrix . The spectral theorem tells us that for some orthogonal matrix and a diagonal matrix with nonnegative diagonal entries.

Combining these equations, we get

Since a product of orthogonal matrices is , we can define and obtain the singular value decomposition (SVD) of :

where and are orthogonal matrices.

We can visualize the decomposition geometrically by making a figure like the one shown below, which illustrates the successive effects of each map in the composition . If we draw grid lines on the second plot (just before is applied) and propagate those grid lines to the other plots by applying the indicated maps, then we endow the domain and range of with orthogonal sets of gridlines with mapping one to the other.

We can extend the singular value decomposition to rectangular matrices (that is, matrices which are not necessarily square) by adding rows or columns of zeros to a rectangular matrix to get a square matrix, applying the SVD to that square matrix, and then trimming the resulting matrix as well as either or (depending on which dimension of is smaller). We get a decomposition of the form where is an orthogonal matrix, is an orthogonal matrix, and is a rectangular diagonal matrix.

Theorem (Singular value decomposition)
Suppose that is an matrix. Then there exist orthogonal matrices and and a rectangular diagonal matrix such that

We call the a singular value decomposition (or SVD) of . The diagonal entries of are called the singular values of .

The diagonal entries of , which are the square roots of the eigenvalues of , are called the singular values of . The columns of are called left singular vectors, and the columns of are called right singular vectors.

Looking at the bottom half of the SVD figure, we see that the singular values of are the lengths of the semi-axes of the ellipsoid in obtained as the image under of the unit ball in . Moreover, the directions of these axes are the columns of , since they are the images under of the standard basis vectors. We will see an important application of this feature of the SVD in the probability chapter when we discuss principal component analysis.

As an example of how the singular value decomposition can be used to understand the structure of a linear transformation, we introduce the Moore-Penrose pseudoinverse of an matrix . We define to be , where is the matrix obtained by inverting each nonzero element of . The pseudoinverse is a swiss-army knife for solving the linear system :

• If is square and invertible, then
• If has no solution, then is the value of which minimizes (in other words, the closest thing to a solution you can get).
• If has multiple solutions, then is the solution with minimal norm.

Exercise
Show that has SVD . Find its Moore-Penrose pseudoinverse.

Solution. Let

and let , , and be the three matrices given in the problem statement.

We need to check that are orthogonal, and that . We can verify that are orthogonal by showing that their columns are orthogonal unit vectors. Equivalently, we may compute the products and and observe that they are identity matrices. Similarly, can be verified by hand or on a computer.

The formula for the Moore-Penrose pseudoinverse is

The matrix is obtained by inverting the nonzero elements on the diagonal of , and the transposing the resulting matrix.

With a little calculation, we arrive at

## SVD and linear dependence

Linear dependence is numerically fragile: if the columns of a matrix (with more rows than columns) are linearly dependent, then perturbing the entries slightly by adding tiny independent random numbers is almost certain to result in a matrix with linearly independent columns. However, intuition suggests that subverting the principles of linear algebra in this way going to solve any real-world problems that emerge from linear dependence relationships among columns of a matrix.

This intuition is accurate, and it highlights the utility of having a generalization of the idea of linear independence which can quantify how close a list of vectors is to having linear dependence relationships, rather than remaining within the confines of the binary labels "linearly dependent" or "linearly independent". The singular value decomposition provides such a tool.

Exercise
Define a matrix with 100 rows and 5 columns, and do it in such a way that two of the five columns are nearly equal to some linear combination of the other three. Calculate the singular values of the matrix, and make a conjecture about how the number of approximate linear dependencies could have been detected from the list of singular values.

import numpy as np

Solution. We see that two of the singular values are much smaller than the other three. (Remember that you have to run the cell twice to get the plot to show.)

import numpy as np
import matplotlib.pyplot as plt
A = np.random.standard_normal((100,5))
A[:,3] = A[:,2] + A[:,1] + 1e-2*np.random.standard_normal(100)
A[:,4] = A[:,1] - A[:,0] + 1e-2*np.random.standard_normal(100)
plt.bar(range(5),np.linalg.svd(A))
using Plots, LinearAlgebra
A = randn(100, 5)
A[:,4] = A[:,3] + A[:,2] + 1e-2*randn(100)
A[:,5] = A[:,2] - A[:,1] + 1e-2*randn(100)
bar(1:5, svdvals(A), label = "singular values")

We conjecture that very small singular values indicates that columns would need to be removed to obtain a matrix which does not have approximate linear dependence relationships among its columns.

In fact, the idea developed in this exercise is used by the NumPy function np.linalg.matrix_rank to calculate the rank of a matrix. Because of the roundoff errors associated with representing real numbers in memory on a computer, most matrices with float entries technically have full rank. Thus np.linalg.matrix_rank computes the singular value decomposition and returns the number of of the matrix which are larger than a given threshold. The threshold is adjustable, but one common setting is times the largest entry of the matrix times the largest dimension of the matrix.

In fact, the idea developed in this exercise is used by the Julia function rank to calculate the rank of a matrix. Because of the roundoff errors associated with representing real numbers in memory on a computer, most matrices with float entries technically have full rank. Thus rank computes the singular value decomposition and returns the number of of the matrix which are larger than a given threshold. The threshold is adjustable, but one common setting is times the largest entry of the matrix times the largest dimension of the matrix.

## SVD and image compression

We close this section with a computational exercise illustrating another widely applicable feature of the singular value decomposition.

Exercise

• Show that if are the columns of , are the columns of , and are the diagonal entries of , then .

• The equation is useful for compression, because terms with sufficiently small singular value factors can be dropped and the remaining vectors and singular values can be stored using less space. Suppose that is a matrix—how many entries does have, and how many entries do , , , , , have in total?

• The Python code below creates a matrix with pixel values for the image shown. How many nonzero singular values does have? Explain how you can tell just from looking at the picture.

figure: img(src="/content/linear-algebra/images/zero.svg")

import numpy as np
import matplotlib.pyplot as plt

m = 80
n = 100
a = m // 8
b = m // 4
A = np.ones((m,n))

def pixel(i,j):
if (a <= i <= b or m-b <= i <= m-a) and a <= j <= n-a:
return 0
elif (a <= j <= b or n-b <= j <= n-a) and a <= i <= m-a:
return 0
return 1

A = np.array([[pixel(i,j) for i in range(1,m+1)] for j in range(1,n+1)])

U, Σ, V = np.linalg.svd(A)

plt.matshow(A)
using LinearAlgebra, Plots
m = 80
n = 100
a = m ÷ 8
b = m ÷ 4
A = ones(m,n)

function pixel(i,j)
if (a ≤ i ≤ b || m-b ≤ i ≤ m-a) && a ≤ j ≤ n - a
0
elseif (a ≤ j ≤ b || n-b ≤ j ≤ n-a) && a ≤ i ≤ m - a
0
else
1
end
end

A = [pixel(i,j) for i=1:m,j=1:n]

U, Σ, V = svd(A)
heatmap(A)
• Now add some noise to the image:
B = A +  0.05*np.linalg.standard_normal((m,n))
B = A + 0.05 * randn(m, n)

Display this new matrix , and also find the matrix obtained by keeping only the first three terms of for this matrix . Which looks more like the original image : (i) or (ii) the three-term approximation of ?

Hint: you can achieve this computationally either by setting some singular values to 0 or by indexing the matrices , , and appropriately. Also, you will need the function np.diagonal diagm to generate a diagonal matrix from the vector of values returned by svd.

import numpy as np
import matplotlib.pyplot as plt

Solution.

• Let We need to show that We will do this by first showing that for all

Now, for all By definition of matrix-vector product, is a linear combination of the columns of with weights given by Since is diagonal, it is not hard to see that the th column of is Using definition of matrix-vector product again, we find that the th weight is the dot product of the th row of and But the th row of is by definition, and thus Therefore,

where is being treated as a matrix for all

By linearity of matrix multiplication,

and thus for all Since for all it follows that for any matrix In particular, if is the identity matrix in we have

as required.

• has entries and combined have entries.

• It can be seen from the picture that has kinds of columns: one whose components are all dark, another whose components are light in the middle, and the other whose components are dark on the outside and in the middle with strips of light in between. These columns are clearly linearly independent, and thus has rank Therefore has non-zero singular values.

We can select only the first three terms by suitably indexing the vectors, as follows:

U, Σ, V = np.linalg.svd(B)
plt.matshow(U[:,:3] * np.diag(Σ[:3]) * V.T[:3,:])
U, Σ, V = svd(B)
heatmap(U[:,1:3] * diagm(Σ[1:3]) * V'[1:3,:])  Bruno