Linear Regression with TensorFlow
Linear regression falls into the broad category of supervised learning and is a simple and commonly used Machine Learning (ML) algorithm. As such, it is a good starting point to illustrate how TensorFlow can be used in ML applications. In this post, we will first summarize the cornerstones of linear regression and, later on, we will walk through its Tensorflow implementation.
Linear Regression: Theory
Linear regression models the relation of independent and dependent variables by linear equations. In this post, we will consider the simple case of a one-dimensional problem.
Let us thus suppose to have a certain number of experimental measurements on a certain phenomenon available. As an example, we consider observations reporting the birth rate as a function of the poverty level, see figure below:
In particular, the above figure reports, on the x-axis, the poverty level for each of the 50 states of the USA including also the District of Columbia. The poverty level has been evaluated for the year 2000 and measured as the percentage of each state’s population living in households with incomes below the federally defined poverty level. Moreover, the figure reports, on the y-axis, the birth rate, relative to the year 2002, for 1000 females 15 to 17 years old. From the figure, it can be understood how the link between birth rate and poverty level can be approximated as linear. In other words, the relation between the independent variable y and the dependent variable x can be evaluated as
where m is the slope and b the intercept with the y axis. Once established the above linear approximation, the line can be used to forecast. Saying it differently, whenever there is an interest into guessing the birth rate corresponding to a certain poverty level x not appearing in the scatter plot and once the parameters m and b of the linear relation have been evaluated from the available data, an estimate can be achieved by the above formula.
The parameters m and b are found by approximating the scatter plot by a line. To do so, a measure of the goodness of the approximation must be devised and m and b must be found as the best solutions according to such a measure.
There are many measures of the goodness of our prediction, the most popular one being the mean squared error (MSE)
where N is the number of experimental measurements in the scatter plot (51 in our example), the fn’s are the experimental observations and the yn’s are the values returned by the model, namely,
where the xn’s are the observed poverty levels. Functions like the MSE function detailed above are called loss functions or objective functions. The values of m and b are found in this post as those minimizing the MSE cost function.
The search for the “optimal” m and b can be practically carried out by iteration loops that, following an initial guess for them, performs two main operations:
- measure the goodness of the fit based on the MSE;
- adjust the unknown parameters m and b.
The operations in the loop are repeated until the MSE “looks good’’.
More in detail, the adjustment, or update, of the unknown parameters can be operated by using methods based on the computation of the gradient of the MSE functional. Among the various existing gradient-based methods, in the following we will use a simple one, known as gradient-descent, which is very often employed in approaches of artificial intelligence. On denoting by
the unknowns vector, the gradient-descent method updates the unknowns according to the following rule
where
is the unknowns vector at the current step, while
is the updated unknowns vector,
is the current gradient value and
is the so-called learning rate. The learning rate is a User-chosen parameter, it represents how much do we move in the unknowns space along the direction of the gradient and should be chosen sufficiently large to observe significant changes in the functional value.
The iterations can be quit according to different stopping criteria. For example:
- the algorithm is terminated once a specified number of iterations is reached;
- the algorithm once a specified maximum MSE is satisfied;
- the algorithm is terminated if the MSE does not decrease in the next iteration; for example, if the difference between two successive MSEs is less than 0.001, then the algorithm is stopped.
In the case when the observations are multi-dimensional instead of one-dimensional, then the experimental observations become vectors
while the model becomes
where
is the vector of independent parameters,
is a coefficient matrix,
is the offset vector and
the vector of independent variables.
In the next subsection, we will see how it is possible to put theory into practice by using TensorFlow 2.x.
Linear Regression: Practice
To discuss about the code, the first performed operations are the imports
The TensorFlow library is imported as tf
while the Numpy library as np
and the pyplot
module of the Matplotlib library as plt
. The Numpy library is used to manage arrays and for the random number generation, while the Matplotlib library for the final plot.
The next operation is to provide a short name to the random
module of the Numpy library, so that we let
Afterwards, we define the simulation parameters, namely, the learning rate alpha
, the number of iterations of the gradient descent numIter
and skipIter
. The idea is to output the simulation state in terms of iteration number, cost function and current values of the unknowns every skipIter
iterations:
At this point, we need to define the training dataset, namely, the “poverty level” X
— “birth rate” Y
couples:
The above data have been taken from:
J.M. Utts, R.F. Heckard, Mind on Statistics, Fifth Ed., Cengage Learning, Stamford, CT, 2015
Two TensorFlow variables are then defined, namely, m
and b
and initialized to random values according to a Gaussian distribution with unit variance. Such two variables are appointed to store the current values of the unknowns:
It is now necessary to define two functions. The first one represents the linear model, namely
while the second one the MSE functional, namely
Following the definition of the two functions specifying the linear regression method, the optimizer is chosen and set to be the stochastic gradient descent with a learning rate equal to alpha
:
Without going into the details, let us just mention that, with the setting in the above code snippet, optimizer
particularizes into the classical gradient descent.
Let us now show the optimization loop:
As it can be seen, it is made up by two parts. The first represents the generic optimization step optimizationStep()
. The second deals with the visualization, every skipIter
iterations, of the corrent optimization result corresponding to the current values of the estimates Y
. The optimization step is provided by the function
It should be noticed that the optimization step accounts also for the gradient calculation by automatic differentiation. In particular, the construct with
serves to record all the operations to be performed when invoking tf.GradientTape()
for the forward evaluation of the functional. In this way, tf.GradientTape()
returns the derivative of the loss function with respect to weight and bias later on. This information is passed to optimizer.apply_gradients
which performs so that the optimization step occurs with gradient information.
The result of the optimization is illustrated in the following figure:
The full code is available at our GitHub page: https://github.com/vitalitylearning2021/Machine-Learning/tree/main/Linear_Regression .