gradient descent seems to fail

gradient descent octave
gradient descent coursera
gradient descent matlab machine learning
gradient descent cost function matlab
gradient descent multiple variables octave
gradient descent algorithm example
how to overcome local minima in gradient descent
error y undefined near line 7 column 12 error: called from computecost at line 7 column 3

I implemented a gradient descent algorithm to minimize a cost function in order to gain a hypothesis for determining whether an image has a good quality. I did that in Octave. The idea is somehow based on the algorithm from the machine learning class by Andrew Ng

Therefore I have 880 values "y" that contains values from 0.5 to ~12. And I have 880 values from 50 to 300 in "X" that should predict the image's quality.

Sadly the algorithm seems to fail, after some iterations the value for theta is so small, that theta0 and theta1 become "NaN". And my linear regression curve has strange values...

here is the code for the gradient descent algorithm: (theta = zeros(2, 1);, alpha= 0.01, iterations=1500)

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)

m = length(y); % number of training examples
J_history = zeros(num_iters, 1);

for iter = 1:num_iters

for i=1:m, 
    tmp_j1 = tmp_j1+ ((theta (1,1) + theta (2,1)*X(i,2)) - y(i));

for i=1:m, 
    tmp_j2 = tmp_j2+ (((theta (1,1) + theta (2,1)*X(i,2)) - y(i)) *X(i,2)); 

    tmp1= theta(1,1) - (alpha *  ((1/m) * tmp_j1))  
    tmp2= theta(2,1) - (alpha *  ((1/m) * tmp_j2))  


    % ============================================================

    % Save the cost J in every iteration    
    J_history(iter) = computeCost(X, y, theta);

And here is the computation for the costfunction:

function J = computeCost(X, y, theta)   %

m = length(y); % number of training examples
J = 0;
for i=1:m, 
    tmp = tmp+ (theta (1,1) + theta (2,1)*X(i,2) - y(i))^2; %differenzberechnung
J= (1/(2*m)) * tmp

I think that your computeCost function is wrong. I attended NG's class last year and I have the following implementation (vectorized):

m = length(y);
J = 0;
predictions = X * theta;
sqrErrors = (predictions-y).^2;

J = 1/(2*m) * sum(sqrErrors);

The rest of the implementation seems fine to me, although you could also vectorize them.

theta_1 = theta(1) - alpha * (1/m) * sum((X*theta-y).*X(:,1));
theta_2 = theta(2) - alpha * (1/m) * sum((X*theta-y).*X(:,2));

Afterwards you are setting the temporary thetas (here called theta_1 and theta_2) correctly back to the "real" theta.

Generally it is more useful to vectorize instead of loops, it is less annoying to read and to debug.

gradient descent seems to fail, You can simply use the following method to vectorized the theta, that might help you solve your problem: theta = theta - (alpha/m * (X * theta-y)'  Sadly the algorithm seems to fail, after some iterations the value for theta is so small, that theta0 and theta1 become "NaN". And my linear regression curve has strange values here is the code for the gradient descent algorithm: (theta = zeros(2, 1);, alpha= 0.01, iterations=1500)

Linear regression using gradient descent in Octave seems to fail , The problem is I wasn't running enough iterations. 40 was very misleading. 1'000​-10'000 is a much better number and then it fits almost  Linear regression using gradient descent in Octave seems to fail. Ask Question Asked 5 years, 6 months ago. Active 5 years, 6 months ago. Viewed 621 times

i vectorized the theta thing... may could help somebody

theta = theta - (alpha/m *  (X * theta-y)' * X)';

Does gradient descent always converge to an optimum?, However, it seems to me that, if it diverges from some optimum, then it will eventually go to another optimum. Hence, gradient descent would be guaranteed to  But when I use the gradient descent the theta change quickly for a positive slope that cut the data perpendicular to the expected slope then the thetas update become extremely slow : For the gradient descent I use a very small alpha (1e-10!) and I can't increase it, otherwise it failed by overshooting the cost function.

If you are OK with using a least-squares cost function, then you could try using the normal equation instead of gradient descent. It's much simpler -- only one line -- and computationally faster.

Here is the normal equation:

And in octave form:

theta = (pinv(X' * X )) * X' * y

Here is a tutorial that explains how to use the normal equation:

Gradient Descent Algorithm and Its Variants, Gradient Descent is the most common optimization algorithm in machine If α is large, it may fail to converge and overshoot the minimum. It seems that the following code finds the gradient descent correctly: def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() for i in range(0, numIterations): hypothesis =,

While not scalable like a vectorized version, a loop-based computation of a gradient descent should generate the same results. In the example above, the most probably case of the gradient descent failing to compute the correct theta is the value of alpha.

With a verified set of cost and gradient descent functions and a set of data similar with the one described in the question, theta ends up with NaN values just after a few iterations if alpha = 0.01. However, when set as alpha = 0.000001, the gradient descent works as expected, even after 100 iterations.

Gradient Descent: All You Need to Know, Gradient Descent is THE most used learning algorithm in Machine Learning If this still seems a little confusing, here's a little Neural Network I  Gradient Descent is an algorithm which is designed to find the optimal points, but these optimal points are not necessarily global. And yes if it happens that it diverges from a local location it may converge to another optimal point but its probability is not too much.

Intro to optimization in deep learning: Gradient Descent, Okay, so far, the tale of Gradient Descent seems to be a really happy learning rate is decayed after every time the loss fails to improve after  The cost generated by my stochastic gradient descent algorithm is sometimes very far from the one generated by FMINUC or Batch gradient descent. while batch gradient descent cost converge when I set a learning rate alpha of 0.2, I am forced to set a learning rate alpha of 0.0001 for my stochastic implementation for it not to diverge.

[PDF] Lecture 6: September 12 6.1 Gradient Descent: Convergence Analysis, Fall 2013. Lecture 6: September 12. Lecturer: Ryan Tibshirani. Scribes: Micol Marchetti-Bowick 6.1.1 Convergence of gradient descent with fixed step size. For logistic regression, sometimes gradient descent will converge to a local minimum (and fail to find the global minimum). This is the reason we prefer more advanced optimization algorithms such as fminunc (conjugate gradient/BFGS/L-BFGS/etc).

Gradient Descent Intuition, Here's a gradient descent algorithm that we saw last time and just to remind you and increase theta wish again seems like the thing I wanted to do to try to get large, then gradient descent can overshoot the minimum and may even fail to  I have implemented following code for gradient descent using vectorization but it seems the cost function is not decrementing correctly.Instead the cost function is increasing with each iteration. Assuming theta to be an n+1 vector, y to be a m vector and X to be design matrix m*(n+1)

  • i guess u skipped vectorisation lecture - same as me ;)
  • thanks ;) (I attended the course last year too and wanted to map the solution on a current problem;), so your answer should be okay ;))
  • BTW, maybe I am forgetting but should 1st one not be ? theta_1 = theta(1) - alpha * (1/m) * sum(X*theta - y)
  • Thanks, I used a loop instead of vectorisation and had same problem, what a life saver! Agreed looks smarter
  • and vectorized form is also scalable across multiple features
  • This was tremendously helpful - the verbosity of the matrix math made all the difference for me.
  • This one is the right answer. Another way to write this: theta = theta - alpha / m * (X' * (X * theta - y)); It's better to use vectoriaztion when possible.
  • ah excellent, i knew there had to be a way but I couldnt get the math right on paper ;)
  • For those copy-and-pasting: this is the correct update only for a linear activation function, not for sigmoids and all the other stuff.
  • Why isn't it like this: theta = theta - (alpha/m) * sum(((X * theta) - y)' * X);? The gradient descent equation contains a summation.
  • you only want to sum accross the values for each theta not all the theta results. X' * (x * theta -y) needs to end up with a 1X2 vector. Using the sum function would result in a 1X1 vector which would ruin the matrix algebra.
  • The question is about Gradient descent. Normal equation can also be used.