De-mystifying Linear Regression Part 2

shutterstock_609131906.jpg

Introduction

In our last post, we introduced simple linear regression and how to calculate simple linear regression using the least squares method. This was from a pure mathematical perspective. In this post, we will discuss using the linear regression model from the last post and using it to make predictions. Then some pro's and con's of linear regression will be discussed. The assumptions of a linear model, diagnosing fit of a model and computing computationally will be left to another blog post.

Recap

From the first post in this blog series, we took a theoretical heating and cooling repair company and took their hours spent on the job data and used it to model the cost based on the amount of hours needed for a job. We expect our client to plug in a job hours estimate and be able to get out an estimated cost. This cost could vary based on labor and parts, which is a discussion for a different time, but for the purpose of this conversation, we are just going to use job hours.

As a reminder, here is a plot of the data:

Pic1.png

We then fit a simple linear regression model using least squares. When we do that, we come up with the equation:

eqn8665.png

When we plot this line to the data, we get:

Pic2.png

The steps to do this are discussed in the last post, but we can see that the line appears to fit the data pretty well. Our model statistics would confirm this, which is a different post on evaluating models.

Making Predictions

Usually in simple linear regression, the model would be evaluated before predictions are made, however this writing is just meant to provide an overview of simple linear regression and to show the predictive nature of linear regression models.

Now the company has their model built, so the first thing they want to do is take it for a test spin and estimate some job costs.

Using the linear regression function, making a prediction is as simple as plugging in x. Recall our equation is:

eqn8665.png

So if we wanted to predict cost for a 4.5 hour job, we would plug in 4.5 for x:

eqn86652.png

Which gives:

453.87 rounding to two decimal places.

Since our cost for a 4 hour job is around $400 and the cost for a 5 hour job is $500, this seems very logical.

Plotting this estimate on the graph in red gives:

Pic3.png

Using our gut judgement, that value seems very likely. Its probably pretty unlikely that this exact value will be the actual cost and many people would use a prediction interval to show the range of possible predictions for this value. However, in real life, an actual number to estimate a response is required. The prediction value usually falls in the center of that prediction interval to give a generalized value, almost like finding the average of a range of possible prediction values.

What about values that are outside of the range of data we have? What if the business wants to know how much a 9 hour job or a 12 hour job would take? We can run this through our model as the x variable in the same fashion. This gives:

The estimated cost for a 9 hour job is 895.1141

The estimated cost for a 12 hour job is 1189.279

These estimates sound likely according to our model, but there is a theoretical question here. Do we really expect the same data trend to continue with jobs that long? For example, both jobs exceed 8 hours, so wouldn't there be additional costs for a job spanning into 2 days? These questions are really unanswerable without data, but industry experience can be used to supplement the lack of data. If the experts of this situation think the model would remain consistent, then this extrapolation would be fine. There are many statisticians that would argue against extrapolating a model, but if the extrapolation is certain, there is nothing wrong here. However, if the company is unsure if the pattern would hold, the data should not be extrapolated and the model should only be used as a guideline, if at all.

Breaking Down the Equations and Results

Recall that our equation to fit this model is:

eqn8665.png

In this situation we can translate our coefficients into actual values that mean something. First, since Beta_0 represents a cost when our work is zero, we can think of this as the fixed cost portion of the model. This means a job costs $12.6 regardless of the work done. This would cover pieces of the business not dependent on the job such as the company owned building.

Next, the Beta_1 coefficient, which is 98.0551331, represents the rate of change which is completely variable. This number means a typical service job costs around $99 an hour including parts and labor. This number depends on the length of the job, which is the variable piece.

It is easy to see why this information is useful to our business. While this is a simplified example of how a business can make use of a linear model, it is an effective method and model. The business does not need to know the exact cost of parts and labor, just provide an hour estimation.

Pros and Cons of Simple Linear Regression

There are a lot of reasons why linear regression is a good model to use in predictive analytics and machine learning.

1.          The model is interpretable - With the simple example, the model and the coefficients make sense. They are easy to understand and can be used to predict and are easy to explain. This makes linear regression one of the best models to use even though more accurate options may exist.

2.          The model has low variance - This means that as data points are replaced and the training data changed, the model will only shift slightly. This gives a consistent result model to model with a small amount of change between each model. Low variance is a desired quality in machine learning.

3.          Linear regression can be expanded to use multiple parameters - Multiple linear regression takes multiple input parameters and can produce an even more accurate estimate. In our example, there are probably other input parameters that affect the cost of a job, so once captured they can be used to predict a more accurate cost. Note that some relationships are quadratic or cubic, these can also be fit using linear regression.

While there are some pros to linear regression, it also has its shortcomings:

1.          It is inflexible - Linear regression fits a straight line to a dataset. Very few real world examples are purely linear. There are usually more complicated relationships between variables. There are often curves and shifts in data that cannot be modeled with a simple linear regression function.

2.          It exhibits high bias - Bias is a statistical property that indicates how much error is introduced by trying to model a complex situation with a simple model. It is the average of the difference between actual values and the predicted values, this should come out as 0 in a perfect model. The model may exhibit low variance, but will produce high bias since it cannot accommodate patterns in the data that may emerge. Ideally, both variance and bias should be minimized. This results from the inflexibility of the model.

Conclusion

Simple linear regression is a great tool. It is often used in other machine learning models and neural networks. The machine learning community has downplayed it's importance, but if the model fits and predicts well, then use it. A simple model that can be explained is often more valuable than a complex model that cannot be explained. For more complicated models, multiple linear regression can be valuable as well. What we did not cover in this writing was how to determine if the model is adequate. We will discuss this in a different post on linear regression model adequacy. We will also discuss multiple linear regression in a future post.