Just finished a online extension course on Machine Learning via Stanford Engineering school.
Excellent course.
Linear RegressionWe learned about simple linear regression - fitting a line/curve to a 2D set of data points. This was then expanded to include 3 or more dimensions. This is useful for prediction and extrapolation where 2 or more variables (dimensions) represent correlated features (height and weight for example.) In a more complex scenario, we focused on predicting house prices base on various features such as square footage, number of rooms, number of floors, etc. Some important learning from this was the need to normalize/scale the features so they have similar ranges: Xs = (Xi-mean(X))/range(X).
Linear regression fits a straight line, but by applying increasing powers to each feature variable, (x1, x2^2, x3^3, etc), and passing the results into the gradient descent algorithm, you can fit a polynomial curve. Or take logarithms, roots, etc., to fit other types of curves.
The algorithm which finds the best fit is called gradient descent. We form a hypothesis formula to predict our desired outcome (in this case price): h(X) = p1*x1 + p2*x2 + p3*x3...
The trick is to find the values of the vector P (p1,p2,p3,...) which gives the best prediction. We determine how good/bad the prediction is by comparing the training data X with the predicted data Y in a cost function J = mean((X-Y)^2) - so the best fit would equate to the lowest cost function results.
Gradient descent uses the derivative of the cost function which calculates the rate of change of the cost function with respect to each parameter in P. p1 = p1 - a*derivative(J,p1)
'a' here is the learning parameter which decides how big a step along the cost curve to take. The parameter is updated. As it reaches the minmum point of the cost curve, the derivative essentially becomes 0 (zero slope), so it no longer changes. At this point, the parameter has been found for that minimum.
The algorithm converges then on the parameter values, which, when fed into the hypothesis equation above - should represent the best fit to the training data.
Logistic RegressionLinear regression is good for prediction of real valued numbers, but in many cases, we want to use a set of features to classify something true or false. For example, whether a tumor is malignant/benign or whether an email is spam or not. Linear regression does not reliably work in these cases because the resulting variable is integral (0 or 1 in these cases) - not a real number.
In this case we need a hypothesis function that examines variables bounded between 0 and 1. We use a "sigmoid" function to bound the values h(x) = 1 / (1 + e^(theta*x)) where "theta" is a parameter similar to linear regression. This function has values between 0 and 1 - essentially representing the probability that, given theta and x, the result y is true or false.
With this, we can create a cost function that compares the difference between the predicted values y and the training data. We calculate the derivative of this function and again use gradient descent to find the optimal parameters (thetas).
Neural NetworksNeural networks are a possibly much more efficient way to find the optimal parameters for a classification problem - such as the ones we solve with Logistic Regression. Especially ones with large number of classes. In logistic regression, our output classes are usually yes/no (k=2), or perhaps a prediction of the letter grades students will get based on certain habits they exhibit, a multi-class definition k>2 where k is less than 10.
Neural networks would extend to something like predicting letters in an alphabet based on the pixels (each one being a variable) in an image. This has a k=26 set of classes and for an image of 20*20 gray scale pixels, 400 variables (and thus 400 parameters to calculate).
Logistic regression would take a lot of time and would examine many non-optimal parameter values as it searched. Neural networks have been found to more quickly converge to optimal parameters by calculating the parameter values in stages (or "layers") that feed outputs from an input layer to one or more inner layers, that feed an output layer.
We're still working with a hypothesis function and cost function, and we use numerical method to minimize the cost function - which essentially is the neural network. The cost function uses the neural network algorithm to determine a cost, along with an array of gradients (derivatives) for each variable. These outputs are fed into the minimization function to calculate the optimal parameter weights for the neural network layers.
Once this "training" is completed, we have a set of weights for each layer of the network. The prediction calculation then boils down to a series of matrix multiplications for each layer to calculate the output values - which is very speedy.
In all the above cases, we use a training set of data with known inputs/outputs. The hypothesis function is written in a way that pairs each variable (feature) with a parameter that weights the variable in some way. The comparison between the known predictions and the hypothesized predictions is the "cost" function. In all cases, our goal is to minimize the difference or cost by choosing the best weights (parameters) to use in the hypothesis function.