# Numerical ComputationOptimization

Optimization is an area of math with especially plausible real-world applicability, because folks understand that we generally want things to be good. If we can capture a meaningful notion of goodness in the form of a mathematical function, then optimization methods offer to tell us how to make things as good as possible.

Machine learning relies heavily on this pattern. We usually don't have any direct way to select a particular model for a given data set, but we have general classes of models which are often useful. We can choose a specific model from a class by capturing each model's failure to fit the data points in a function called the . We then appeal to general optimization methods to minimize the loss.

Gradient descent is an approach to finding the minimum of a differentiable function from to . The basic idea is to repeatedly step in the direction of , since that is 's direction of maximum decrease from a given point, beginning with some initial guess .

We can choose how large each step should be and when to stop. A common way to determine step size is to fix a learning rate and set . Note that the size of the step naturally gets as we get closer to a local minimum, since the norm of the gradient . One way to choose when to terminate the algorithm is to set a threshold for the norm of the gradient of .

Gradient descent is fundamentally local: it is not guaranteed to find the global minimum since the search can get stuck in a local minimum. This point plays a significant role in deep learning, because loss functions in deep learning do not have unique minima.

Exercise
Consider the function . Implement the gradient descent algorithm for finding the minimum of this function. To take derivatives, you can define a derivative function like df(x) = ForwardDiff.derivative(f,x). • If the learning rate is , which values of have the property that is close to the global minimum of when is large?
• Is there a starting value between and and a learning rate such that the gradient descent algorithm does not reach the global minimum of ? Use the graph for intuition.
using ForwardDiff


Solution. The following is an implementation of gradient descent:

using LinearAlgebra, ForwardDiff

df(x) = ForwardDiff.derivative(f,x)
x = x₀
while abs(df(x)) > threshold
x = x - ϵ*df(x)
end
x
end
f(t) = exp(-t^2/4) * (t^4 - 2t^3 - t^2 + 3t - 1)

Trying various values of , and looking at the graph, we conjecture that the global minimum is reached when the starting value is between the first two points where has a local maximum (approximately and ). Between and the next local maximum (approximately ), the algorithm leads us to the local minimum around . Outside the interval from the first local maximum the last, the sequence of iterates appears to head off to or .

Skipping over the global minimum to the local one requires choosing large enough that the first jump skips over the local maximum at . A little experimentation shows that and works (among many other possibilities).

There are many gradient-descent-based optimization algorithms that improve on plain gradient descent in various ways. In this section, we'll discuss a few of them at a high level, so that you can have a sense of what's going on when you select an appropriate method when making calls to optimization library functions.

One issue with gradient descent is that near a local minimum where the graph of the objective function is valley-like (having steep curvature in some directions and shallow curvature in others), gradient descent moves the iterates quickly to the valley but then slowly to the minimum from there. This can be mitigated by accounting for information about the objective function's second derivative; such methods are called Newton methods. Also popular are quasi-Newton methods, which avoid the expensive computation of the second derivative of the objective function while still making use of that information by approximating it using the gradients that have to be evaluated anyway. BFGS (Broyden- Fletcher-Goldfarb-Shanno) and L-BFGS (a low-memory variant of BFGS) are common quasi-Newton methods.

Common optimizers in deep learning include Nesterov accelerated gradient, which uses the idea of momentum: each step is taken based on a weighted average of the gradients at the current and preceding iterates. RMSProp directly addresses the valley problem by boosting the step size in coordinates with respect to which the partial derivative is small (and tempering the step size in coordinates with large derivative). Adam combines RMSProp with momentum.

Exercise (a) By inspection, find the minimum of the Rosenbrock function

(b) Use the code cell below to find how many iterations the GradientDescent algorithm runs for on the Rosenbrock function, starting from the origin.

(c) Change GradientDescent to BFGS. Did it use fewer iterations? Was it more accurate?

using Optim
rosenbrock(x) =  (1.0 - x)^2 + 100.0 * (x - x^2)^2
optimize(rosenbrock, zeros(2), GradientDescent())

Solution. Based on the report printed when the optimize function runs, GradientDescent fails, taking the maximum-allowed 1000 iterations and getting only somewhat close to the minimum. BFGS converges in 16 iterations with only 53 function and 53 gradient calls, and it gets the answer exactly right.

## Convexity

A subset of is convex if it contains every line segment connecting any two points in the set (for example, triangles, circles, and regular polygons are convex, while a star shape or an "L" would not be convex).

A function from to is convex if a line segment connecting any two points on its graph lies on or above the graph. For example, a function like with a bowl-shaped graph ( ) is convex, while a function like with a saddle-shaped graph ( ) is not convex. A convex function is strictly convex if a line segment connecting any two points on its graph touches the graph only at the endpoints. You can check whether a smooth function is convex by checking whether its Hessian is positive semidefinite everywhere. If the Hessian is positive definite everywhere, then the function is also strictly convex.

A convex optimization problem is a problem of the form find the minimum value of , where is convex and is convex. Compared to general optimization, convex optimization is particularly well-behaved:

Theorem
If is convex and is convex, then any local minimum of is also a global minimum of . Furthermore, if is strictly convex, the has at most one local minimum.

Convex optimization problems play an important role in applied math and data science, because (1) many optimization problems of interest can be expressed in the form of a convex optimization problem, and (2) specialized, fast numerical methods are available for such problems.

Example
Use the Julia package JuMP with the Ipopt solver to find the minimum of the function on the half-plane .

Solution. The JuMP model involves instantiating a Model object and then (1) creating variables, (2) adding constraints, (3) adding an objective function, (4) solving the problem, and (5) retrieving the values from the variable objects.

using JuMP, Ipopt

model = Model(with_optimizer(Ipopt.Optimizer, print_level=0))
@variable(model,x)
@variable(model,y)
@constraint(model, x + y ≥ 1)
@objective(model, Min, x^2 + 2y^2)
optimize!(model)
JuMP.value(x), JuMP.value(y)

The last line returns (0.6666, 0.3333). You can confirm using the method of Lagrange multipliers that the correct answer is .

Exercise
Use JuMP to find the line of best fit for the points . In other words, find the values and such that the sum of squared vertical distances from these three points to the line is minimized.

Solution. We can follow the same approach, but we don't have to use any constraints:

using JuMP, Ipopt

A = [1 2; 2 5; 4 5]
model = Model(with_optimizer(Ipopt.Optimizer, print_level=0))
@variable(model,m)
@variable(model,b)
@objective(model, Min, sum([(y - m*x - b)^2 for (x,y) in eachrow(A)]))
optimize!(model)
JuMP.value(m), JuMP.value(b)  Bruno