Linear Regression with TensorFlow

Linear regression falls into the broad category of supervised learning and is a simple and commonly used Machine Learning (ML) algorithm. As such, it is a good starting point to illustrate how TensorFlow can be used in ML applications. In this post, we will first summarize the cornerstones of linear regression and, later on, we will walk through its Tensorflow implementation.

Linear Regression: Theory

Linear regression models the relation of independent and dependent variables by linear equations. In this post, we will consider the simple case of a one-dimensional problem.

Let us thus suppose to have a certain number of experimental measurements on a certain phenomenon available. As an example, we consider observations reporting the birth rate as a function of the poverty level, see figure below:

Observation of birth rates against poverty level.

In particular, the above figure reports, on the x-axis, the poverty level for each of the 50 states of the USA including also the District of Columbia. The poverty level has been evaluated for the year 2000 and measured as the percentage of each state’s population living in households with incomes below the federally defined poverty level. Moreover, the figure reports, on the y-axis, the birth rate, relative to the year 2002, for 1000 females 15 to 17 years old. From the figure, it can be understood how the link between birth rate and poverty level can be approximated as linear. In other words, the relation between the independent variable y and the dependent variable x can be evaluated as

where m is the slope and b the intercept with the y axis. Once established the above linear approximation, the line can be used to forecast. Saying it differently, whenever there is an interest into guessing the birth rate corresponding to a certain poverty level x not appearing in the scatter plot and once the parameters m and b of the linear relation have been evaluated from the available data, an estimate can be achieved by the above formula.

The parameters m and b are found by approximating the scatter plot by a line. To do so, a measure of the goodness of the approximation must be devised and m and b must be found as the best solutions according to such a measure.

There are many measures of the goodness of our prediction, the most popular one being the mean squared error (MSE)

where N is the number of experimental measurements in the scatter plot (51 in our example), the fn’s are the experimental observations and the yn’s are the values returned by the model, namely,

where the xn’s are the observed poverty levels. Functions like the MSE function detailed above are called loss functions or objective functions. The values of m and b are found in this post as those minimizing the MSE cost function.

The search for the “optimal” m and b can be practically carried out by iteration loops that, following an initial guess for them, performs two main operations:

  • measure the goodness of the fit based on the MSE;
  • adjust the unknown parameters m and b.

The operations in the loop are repeated until the MSE “looks good’’.

More in detail, the adjustment, or update, of the unknown parameters can be operated by using methods based on the computation of the gradient of the MSE functional. Among the various existing gradient-based methods, in the following we will use a simple one, known as gradient-descent, which is very often employed in approaches of artificial intelligence. On denoting by

the unknowns vector, the gradient-descent method updates the unknowns according to the following rule


is the unknowns vector at the current step, while

is the updated unknowns vector,

is the current gradient value and

is the so-called learning rate. The learning rate is a User-chosen parameter, it represents how much do we move in the unknowns space along the direction of the gradient and should be chosen sufficiently large to observe significant changes in the functional value.

The iterations can be quit according to different stopping criteria. For example:

  • the algorithm is terminated once a specified number of iterations is reached;
  • the algorithm once a specified maximum MSE is satisfied;
  • the algorithm is terminated if the MSE does not decrease in the next iteration; for example, if the difference between two successive MSEs is less than 0.001, then the algorithm is stopped.

In the case when the observations are multi-dimensional instead of one-dimensional, then the experimental observations become vectors

while the model becomes


is the vector of independent parameters,

is a coefficient matrix,

is the offset vector and

the vector of independent variables.

In the next subsection, we will see how it is possible to put theory into practice by using TensorFlow 2.x.

Linear Regression: Practice

To discuss about the code, the first performed operations are the imports

The TensorFlow library is imported as tf while the Numpy library as np and the pyplot module of the Matplotlib library as plt . The Numpy library is used to manage arrays and for the random number generation, while the Matplotlib library for the final plot.

The next operation is to provide a short name to the random module of the Numpy library, so that we let

Afterwards, we define the simulation parameters, namely, the learning rate alpha , the number of iterations of the gradient descent numIter and skipIter . The idea is to output the simulation state in terms of iteration number, cost function and current values of the unknowns every skipIter iterations:

At this point, we need to define the training dataset, namely, the “poverty level” X — “birth rate” Y couples:

The above data have been taken from:

J.M. Utts, R.F. Heckard, Mind on Statistics, Fifth Ed., Cengage Learning, Stamford, CT, 2015

Two TensorFlow variables are then defined, namely, m and b and initialized to random values according to a Gaussian distribution with unit variance. Such two variables are appointed to store the current values of the unknowns:

It is now necessary to define two functions. The first one represents the linear model, namely

while the second one the MSE functional, namely

Following the definition of the two functions specifying the linear regression method, the optimizer is chosen and set to be the stochastic gradient descent with a learning rate equal to alpha:

Without going into the details, let us just mention that, with the setting in the above code snippet, optimizer particularizes into the classical gradient descent.

Let us now show the optimization loop:

As it can be seen, it is made up by two parts. The first represents the generic optimization step optimizationStep() . The second deals with the visualization, every skipIter iterations, of the corrent optimization result corresponding to the current values of the estimates Y . The optimization step is provided by the function

It should be noticed that the optimization step accounts also for the gradient calculation by automatic differentiation. In particular, the construct with serves to record all the operations to be performed when invoking tf.GradientTape() for the forward evaluation of the functional. In this way, tf.GradientTape() returns the derivative of the loss function with respect to weight and bias later on. This information is passed to optimizer.apply_gradients which performs so that the optimization step occurs with gradient information.

The result of the optimization is illustrated in the following figure:

Linear regression results on the observation of birth rates against poverty level.

The full code is available at our GitHub page: .



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Vitality Learning

We are teaching, researching and consulting parallel programming on Graphics Processing Units (GPUs) since the delivery of CUDA. We also play Matlab and Python.