Regression: Crash Course Statistics #32

which of the following is true about regression analysis? This is a topic that many people are looking for. is a channel providing useful information about learning, life, digital marketing and online courses …. it will help you have an overview and solid multi-faceted knowledge . Today, would like to introduce to you Regression: Crash Course Statistics #32. Following along are instructions in the video below:

“I m adriene hill and welcome back to ncrash course. Statistics. There s something to to be said for flexibility it allows you to adapt to new circumstances like a is a truck. But it can nalso be an awesome fighting robot today we ll introduce you to one of the nmost flexible statistical tools the general linear model or glm the glm will allow us to create many different models to help describe the world the first we ll talk about is the regression nmodel intro general linear models.

Say that your data can be explained by two things your model and some error first the model. It usually takes the form y mx b. Or rather y b. Mx.

In most cases. Say i want to predict the number of trick or treaters. I ll get this halloween by using enrollment numbers from the local middle school. I have to make sure i have enough candy on hand.

I expect a baseline of 25 trick or treaters and then for every middle school student ni ll increase the number of trick or treaters i expect by. 001 so this would be my model there were about 1000. Middle school students nearby last year so based on my model. I predicted that i d get 35 trick or treaters.

But reality doesn t always match predictions when halloween came around i got 42. Which means that the error in this case was 7. Now error. Doesn t mean that something s nwrong per se.

We call it error because it s a deviation nfrom our model. So the data isn t wrong. The model is and these errors can come from many sources like variables we didn t account for in our model including the candy crazed kindergartners from the elementary school or just random variation models allow us to make inferences whether it s the number of kids on my doorstep at halloween or the number of credit card frauds ncommitted in a year general linear models take the information nthat data give us and portion it out into two major parts. Information that can be accounted for by our model and information that can t be there s many types of glms.

One is linear nregression. Which can also provide a prediction for our ndata. But instead of predicting our data using a ncategorical variable like we do in a t test. We use a continuous one for example we can predict the number of nlikes a trending youtube video gets based on the number of comments that it has here the number of comments would be our ninput variable and the number of likes our output variable our model will look something like this the first thing we want to do is plot.

Our ndatafrom 100 videos this allows us to check whether we think that nthe data is best fit by a straight line and look for outliers those are points that are nreally extreme compared to the rest of our data. These two points look pretty far away from nour data. So we need to decide how to handle them we covered outliers in a previous episode nand. The same rules apply here we re trying to catch data that doesn t nbelong since we can t always tell when that happened nwe set a criteria for what an outlier is and stick to it one reason that we re concerned with outliers nin regression is that values that are really far away from the rest of our data can have nan undue influence on the regression line without this extreme point our line would nlook like this but with it like this that s a lot of difference for one little npoint.


There s a lot of different ways to decide nbut in this case. We re gonna leave them in one of the assumptions that we make when using nlinear regression is that the relationship is linear. So if there s some other shape our data ntakes we may want to look into some other models. This plot looks linear.

So we ll go ahead nand fit our regression model. Usually a computer is going to do this part nfor us. But we want to show you how this line fits. A regression line is the straight line.

That s nas close as possible to all the data points at once that means that it s the one straight line nthat minimizes. The sum of the squared distance of each point to the line. The blue line is our regression line its equation looks like this this number the y intercept. Tells us how nmany likes we d expect a trending video.

With zero comments to have often the intercept might not make much sense in this model. It s possible that you could nhave a video with 0 comments. But a video with 0. Comments and 9104 likes does seem to nconflict with our experience on youtube.

The slope aka the coefficient tells us nhow. Much our likes are determined by the number of. Comments our coefficient here is about 65. Which means nthat on.

Average an increase in. 1. Comment is associated with an increase of about 65. Nlikes.

But there s another part of the general nlinear model. The error before we go any further. Let s take a look nat. These errors also called residuals.

The residual plot looks like this and we can tell a lot by looking at its shape. We want a pretty evenly spaced cloud of residuals. Ideally. We don t want them to be extreme nin.


Some areas and close to 0 in others it s especially concerning if you can see na weird pattern in your residuals like this which would indicate that the error of your npredictions is dependent on how big your predictor variable value is that would be like if our youtube model was npretty accurate at predicting the number of likes for videos with very few comments. But nwas wildly inaccurate on videos with a lot of comments. So now that we ve looked at this error. Nthis is where statistical tests come in there are actually two common ways to do a null hypothesis significance test on a regression coefficient today.

We ll cover the f test. The f test. Like the t test helps us quantify nhow. Well.

We think our data fit a distribution like the null distribution. Remember the general form of many test statistics nis this but i m going to make one small tweak to nthe wording of our general formula to help us understand f tests. A little better the null hypothesis here is that there s nno relationship between the number of comments on a trending youtube video and the number nof likes if that were true we d expect a kind of nblob y amorphous cloud. Looking scatter plot and a regression line with a slope of 0.

It would mean that the number of comments nwould. N t help us predict the number of likes we d just predict the mean number of likes nno matter. How many comments there were back to our actual data this blue line is our observed model and the red is the model we d expect if nthe null hypothesis were true let s add some notation. So it s easier nto.

Read our formulas y hat. Looks like this and it represents the npredicted value for our outcome variable here. It s the predicted number of likes y bar looks like this and it represents. The nmean value of likes in this sample.

Taking the squared difference between each ndata point and the mean line tells us the total variation in our data set this might look similar to how we calculated nvariance because it is variance is just this sum of squared deviations called nthe sum of squares total divided by n. And we want to know how much of that total nvariation is accounted for by our regression model. And how much is just error that would allow us to follow the general nlinear model framework and explain our data with two things the model s prediction nand. Error we can look at the difference between our nobserved slope.

Coefficient. 6468 and the one we d expect if there were no relationship. 0. Nfor each point and we ll start here with this point the green line represents the difference between nour observed model.

Which is the blue line and the model that would occur if the null were ntrue. Which is the red line and we can do this for every point in the ndata set we want negative differences and positive ndifferences to count equally so we square each difference. So that they re all positive. Then we add them all up to get part of the nnumerator of our f statistic.


The numerator has a special name in statistics. It s called the sums of squares for regression nor ssr for short like the name suggests. This is the sum of nthe squared distances between our regression model and the null model now we just need a measure of average variation. We already found a measure of the total variation nin our sample data the total sums of squares and we calculated the variation that s explained nby our model.

The other portion of the variation should nthen represent the error the variation of data points around our model shown here in orange. The sum of these squared distances are called nthe sums of squares for error sse . If data points are close to the regression nline. Then our model is pretty good at predicting outcome values like likes on trending youtube nvideos and so our sse will be small if the data are far from the regression line nthen.

Our model isn t too good at predicting outcome values and our sse is going to be big alright. So now we have all the pieces of nour puzzle total sums of squares sums of squares for nregression and sums of squares for error total sums of squares represents all the information nthat we have from our data on youtube likes sums of squares for regression represents nthe proportion of that information that we can explain using the model. We created and sums of squares for error represents. The nleftover information.

The portion of total sums of squares that the model can t explain so the total sums of squares is the sum of nssr and sse now we ve followed the general linear model. Nframework and taken our data and portioned it into two categories regression model. Nand error. And now that we have the sse our measurement nof error.

We can finally start to fill in the bottom of our f statistic. But we re not quite done yet the last and final step to getting our f statistic. Nis to divide each sums of squares. By their respective degrees of freedom.

Remember degrees of freedom. Represent the amount of independent information that we have the sums of square error. Has n the sample nsize minus 2 degrees of freedom. We had 100 pieces of independent information nfrom.

Our data and we used 1 to calculate the y intercept and 1 to calculate the regression ncoefficient. So the sums of squares for error has 98 degrees nof freedom. The sums of squares for regression has one ndegree of freedom. Because we re using one piece of independent information to estimate nour coefficient our slope.

We have to divide each sums of squares by nits degrees of freedom. Because we want to weight each one appropriately more degrees of freedom mean more information. It s like how you wouldn t be surprised. Nthat.


Katie mack. Who has a phd in astrophysics can explain more about the planets than someone ntaking a high school physics class of course. She can she has way more information similarly we want to make sure to scale. The nsums of squares based on the amount of independent information.

Each have so we re finally left with this and using an f distribution. We can find our np value. The probability that we d get a f. Statistic as big or bigger than 59613 our p value is super tiny it s about 0000 000 000 000 99 with an alpha level of 005 we reject the nnull that there is no relationship between likes and youtube comments on trending videos.

So we reject that true coefficient for the nrelationship between likes and comments on youtube is 0. The f statistic allows us to directly compare nthe amount of variation. That our model can and cannot explain when our model explains a lot of variation nwe consider it statistically significant and it turns out if we did a t test on this ncoefficient. We d get the exact same p value.

That s because these two methods of hypothesis ntesting are equivalent in fact. If you square our t. Statistic. You ll get our f.

Statistic. And we re going to talk. More. About why f tests.

Nare. Important. Later regression is a really useful tool to understand scientists economists and political scientists nuse it to make discoveries and communicate those discoveries to the public regression can be used to model the relationship nbetween increased taxes on cigarettes and the average number of cigarettes. People buy or to show the relationship between peak heart rate during exercise nand blood pressure.

Not that we re able to use regression alone nto determine if it causes changes. But more abstractly we learned today about nthe general linear model framework what happens in life can be explained by two nthings. What we know about how the world works and error or deviations from that model like say you budgeted. 30 for gas and only nended up needing 28 last week.

The reality deviated from your guess. And now nyou get to to go to the blend den again or just. How angry your roommate is that you nleft dishes in the sink can be explained by how many days you left them out with a little nwiggle room for error depending on how your roommate s day was ” ..


Thank you for watching all the articles on the topic Regression: Crash Course Statistics #32. All shares of are very good. We hope you are satisfied with the article. For any questions, please leave a comment below. Hopefully you guys support our website even more.


Leave a Comment