LOS 8.i Multiple Regression and ANOVA

archived_user · Jul 21, 2019

Harrogath wrote:
R2 is the relevant measure for explanatory power. On the other hand, R2 adjusted is meant for comparing models with different quantity of independent variables.

The latter is the penalized version of the other. It penalizes for too complex of a model relative to the sample size and relative to the improvement in reduction of the error variation.

Harrogath wrote: In a sole regression (whatever the quantity of variables) we look at R2. If I want to compare two or more models’ R2, then, I look at R2 adjusted. I see you are misleading the relevance of R2 or thinking that an adjusted measure is superior by definition. Remember that R2 adjusted is derived from R2 and will always be lower than R2, no matter what.

I’m not misleading by introducing the work and guidance of well-trained practicing statisticians. Outside of your scope doesn’t mean it’s incorrect or misleading.

Harrogath wrote: Also, wouldn’t see never a big difference between R2 and R2 adjusted in a parsimonious model with a good sample data (size).

No one is arguing this is the case. The adjustment in R-squared penalizes for a complicated model relative to the sample size (read that as lots of terms with small sample); so this encourages parsimony.

Harrogath wrote: If you are talking about models built in the edge of assumptions, then you may be right. Also, I don’t know what is for you a big difference in R2 and R2 adjusted. As you saw in my simulation, changing from 3 to 50 variables R2 is dropped 15 percent points when adjusted. That difference is bad.

Literally, I have seen 5-20% absolute difference with only a few variables.

Harrogath wrote: Again, I don’t know why you assume R2 adjusted is better than R2 because it penalizes for “junk”. This is not true, R2 captures junk through SSE. Junk is detected when T-cal’s are not statistically significant, when F-cal is not statistically significant, etc.

R-squared doesn’t penalize for junk; R-squared (error variance) never decreases (decreases) as you add more terms to the model, irrespective of their statistical utility. This is blatantly obvious in that the model is fit by minimizing the sum of squared errors which is y-yhat where yhat is b0 + b1X1 + b2X2+…+bkXk … the more terms in that, the smaller the sum of squared errors will necessarily be– this is straight forward. That increases R-squared. Junk can increase R-squared, but not necessarily the adjusted R-squared.

Harrogath wrote: If you introduce junk into your model, both R2 and R2 adjusted will be lower.

This just simply isn’t true. You need to review how R-squared is calculated.

Harrogath wrote:
I was talking about in the scenario of R2, not other measures. Otherwise, you would be right.

You explicitly said that R-squared was the only way to tell.

Harrogath wrote: Yes, yes. My principal motivation to criticize your comment is that S2000 was right, but instead, you added some other explanations that may be considered misleading, so wanted to clarify.

Again, being unfamiliar with the subject doesn’t mean I’m misleading. It means you have some reading to do before continuing to reply based on personal feeling on the subject. I’m speaking strictly from 1) formal education 2) advice and discussion with real statisticians 3) personal experience and observation 4) not from my feeling on the subject.

Harrogath wrote: Just apply the below formulas:
I share with you my little model you can replicate in excel. Sorry, it seems I typed SSE = 5,060 in the forum when in fact I was running the model with 5,400. Sorry for that. Fixed above.

I’m not asking for the formula, that’s not in contention. You have shown us that you didn’t actually conduct a simulation. A good way to do this would be to generate, say a set of X,Y (with two true X variables) that are known to follow some regression model you specify for the simulation. Add some noise to Y so the relationship isn’t deterministic. Then, generate a bunch of other, unpaired x-variables with random values (simulating random independent variables that are junk). Calculate r-squared and adjusted from the multivariable model of Y, X1 and X2, since we know this is the real relationship set in our model. Then adding each junk x-variable, calculate the new values of r-squared and r-squared adjusted. (Ideally you could run this 5000 times at a range of sample sizes to see what tends to happen.) This is a simulation study. What you have done is created some numbers that don’t account for fitting a new model with junk terms, reducing the SSE, and calculating new r-squared values (which is actually what would happen).

Harrogath wrote:
Well, we are in a finance forum, and we suppose you do too. However, I would accept disciplines like medicine and other researches could use a lot of variables without falling in the field of increasing variance of errors in the search of “good fit”.
At least, in economy and finance, the data is not infinite and a parsimonious model will always be preferred. 9 variables for an econ model is a crime. Sorry.
There is the problem, if you do talk about a model in the edge of assumptions or even violated assumptions, then your comments are correct. Otherwise they are not. A sample size of 30 is the bottom possible.

You make so many statements that are emotional and based on feelings, it seems. No one is talking about assumptions, so you’re definitely missing the point with this. Also, you’re ignoring that with less data, R-squared adjusted might be more relevant than with more observations, but you’re pointing out that limited sample size is a concern. You’re staring at food in front of you while saying you’re hungry!

Harrogath wrote: Not narrow, you are just working on fields different from finance. You are always fighting against the CFA program because the curriculum does not teach in detail regressions for other disciplines. Sorry, but 99% of financial analysts, or investment managers in the entire world will never run a regression estimate in their lives. At most, they will interpret an ANOVA table, if so. CFAI has made its job well enough.

I fight against them because even for finance and econ they do a terrible job. A good econometrics book demonstrates that. They are often flat out incorrect. Fun fact, every ANOVA table has an underlying regression model, so again, poor job on the CFA Institute to demonstrate the equivalent cases; a regression equation likely has far more utility than an ANOVA table, but either way, because they are special, common cases of one another, these criticisms still hold.

Harrogath wrote: Should I consider your selective set of approved books? You may publish the list somewhere.
See above. Yes, the only variable controlled was “k”. My intention was to demonstrate that R2 and R2 adjustment are not much different in a parsimonious model with a reasonable sample size.

My point is to pick a book written by someone with a PhD in stats, rather than the CFA curriculum as your reference text. Almost any would be better than the CFAI book.
Claiming “this is just for finance” doesn’t make it correct when it’s flat out wrong.
I guess I didn’t do a good job of avoiding further discussion. I will ask that before you fire back another post based on feeling on the subject that you actually do some reading on it.
Having an understanding of how a model is fit (minimizing SSE, usually) and how the usual R-squared is calculated will pretty clearly contradict many of your points. This understanding and reading will also allow you to look past your own feeling on the subject.
You argument is basically that spoons are only used for eating ice cream because cereal isn’t necessarily tastier and because you prefer ice cream to cereal.
Now, I’ll be good and leave it to that.
P.S. If you are genuinely interested in some book suggestions I’ll happily post a few.

archived_user · Jul 21, 2019

tickersu wrote:

Harrogath wrote:
R2 is the relevant measure for explanatory power. On the other hand, R2 adjusted is meant for comparing models with different quantity of independent variables.

Click to expand...

The latter is the penalized version of the other. It penalizes for too complex of a model relative to the sample size and relative to the improvement in reduction of the error variation.

Harrogath wrote: In a sole regression (whatever the quantity of variables) we look at R2. If I want to compare two or more models’ R2, then, I look at R2 adjusted. I see you are misleading the relevance of R2 or thinking that an adjusted measure is superior by definition. Remember that R2 adjusted is derived from R2 and will always be lower than R2, no matter what.

Click to expand...

I’m not misleading by introducing the work and guidance of well-trained practicing statisticians. Outside of your scope doesn’t mean it’s incorrect or misleading.

Harrogath wrote: Also, wouldn’t see never a big difference between R2 and R2 adjusted in a parsimonious model with a good sample data (size).

Click to expand...

No one is arguing this is the case. The adjustment in R-squared penalizes for a complicated model relative to the sample size (read that as lots of terms with small sample); so this encourages parsimony.

Harrogath wrote: If you are talking about models built in the edge of assumptions, then you may be right. Also, I don’t know what is for you a big difference in R2 and R2 adjusted. As you saw in my simulation, changing from 3 to 50 variables R2 is dropped 15 percent points when adjusted. That difference is bad.

Click to expand...

Literally, I have seen 5-20% absolute difference with only a few variables.

Harrogath wrote: Again, I don’t know why you assume R2 adjusted is better than R2 because it penalizes for “junk”. This is not true, R2 captures junk through SSE. Junk is detected when T-cal’s are not statistically significant, when F-cal is not statistically significant, etc.

Click to expand...

R-squared doesn’t penalize for junk; R-squared (error variance) never decreases (decreases) as you add more terms to the model, irrespective of their statistical utility. This is blatantly obvious in that the model is fit by minimizing the sum of squared errors which is y-yhat where yhat is b0 + b1X1 + b2X2+…+bkXk … the more terms in that, the smaller the sum of squared errors will necessarily be– this is straight forward. That increases R-squared. Junk can increase R-squared, but not necessarily the adjusted R-squared.

Harrogath wrote: If you introduce junk into your model, both R2 and R2 adjusted will be lower.

Click to expand...

This just simply isn’t true. You need to review how R-squared is calculated.

Harrogath wrote:
I was talking about in the scenario of R2, not other measures. Otherwise, you would be right.

Click to expand...

You explicitly said that R-squared was the only way to tell.

Harrogath wrote: Yes, yes. My principal motivation to criticize your comment is that S2000 was right, but instead, you added some other explanations that may be considered misleading, so wanted to clarify.

Click to expand...

Again, being unfamiliar with the subject doesn’t mean I’m misleading. It means you have some reading to do before continuing to reply based on personal feeling on the subject. I’m speaking strictly from 1) formal education 2) advice and discussion with real statisticians 3) personal experience and observation 4) not from my feeling on the subject.

Harrogath wrote: Just apply the below formulas:
I share with you my little model you can replicate in excel. Sorry, it seems I typed SSE = 5,060 in the forum when in fact I was running the model with 5,400. Sorry for that. Fixed above.

Click to expand...

I’m not asking for the formula, that’s not in contention. You have shown us that you didn’t actually conduct a simulation. A good way to do this would be to generate, say a set of X,Y (with two true X variables) that are known to follow some regression model you specify for the simulation. Add some noise to Y so the relationship isn’t deterministic. Then, generate a bunch of other, unpaired x-variables with random values (simulating random independent variables that are junk). Calculate r-squared and adjusted from the multivariable model of Y, X1 and X2, since we know this is the real relationship set in our model. Then adding each junk x-variable, calculate the new values of r-squared and r-squared adjusted. (Ideally you could run this 5000 times at a range of sample sizes to see what tends to happen.) This is a simulation study. What you have done is created some numbers that don’t account for fitting a new model with junk terms, reducing the SSE, and calculating new r-squared values (which is actually what would happen).

Harrogath wrote:
Well, we are in a finance forum, and we suppose you do too. However, I would accept disciplines like medicine and other researches could use a lot of variables without falling in the field of increasing variance of errors in the search of “good fit”.
At least, in economy and finance, the data is not infinite and a parsimonious model will always be preferred. 9 variables for an econ model is a crime. Sorry.
There is the problem, if you do talk about a model in the edge of assumptions or even violated assumptions, then your comments are correct. Otherwise they are not. A sample size of 30 is the bottom possible.

Click to expand...

You make so many statements that are emotional and based on feelings, it seems. No one is talking about assumptions, so you’re definitely missing the point with this. Also, you’re ignoring that with less data, R-squared adjusted might be more relevant than with more observations, but you’re pointing out that limited sample size is a concern. You’re staring at food in front of you while saying you’re hungry!

Harrogath wrote: Not narrow, you are just working on fields different from finance. You are always fighting against the CFA program because the curriculum does not teach in detail regressions for other disciplines. Sorry, but 99% of financial analysts, or investment managers in the entire world will never run a regression estimate in their lives. At most, they will interpret an ANOVA table, if so. CFAI has made its job well enough.

Click to expand...

I fight against them because even for finance and econ they do a terrible job. A good econometrics book demonstrates that. They are often flat out incorrect. Fun fact, every ANOVA table has an underlying regression model, so again, poor job on the CFA Institute to demonstrate the equivalent cases; a regression equation likely has far more utility than an ANOVA table, but either way, because they are special, common cases of one another, these criticisms still hold.

Harrogath wrote: Should I consider your selective set of approved books? You may publish the list somewhere.
See above. Yes, the only variable controlled was “k”. My intention was to demonstrate that R2 and R2 adjustment are not much different in a parsimonious model with a reasonable sample size.

Click to expand...

My point is to pick a book written by someone with a PhD in stats, rather than the CFA curriculum as your reference text. Almost any would be better than the CFAI book.
Claiming “this is just for finance” doesn’t make it correct when it’s flat out wrong.
I guess I didn’t do a good job of avoiding further discussion. I will ask that before you fire back another post based on feeling on the subject that you actually do some reading on it.
Having an understanding of how a model is fit (minimizing SSE, usually) and how the usual R-squared is calculated will pretty clearly contradict many of your points. This understanding and reading will also allow you to look past your own feeling on the subject.
You argument is basically that spoons are only used for eating ice cream because cereal isn’t necessarily tastier and because you prefer ice cream to cereal.
Now, I’ll be good and leave it to that.
P.S. If you are genuinely interested in some book suggestions I’ll happily post a few.

I have not much time for this, and instead of replying each sentence of yours I better sell an idea:
In the process of modeling a variable there is uncertainty about things like the appropriate model to use and the best independent variable set. A good research would imply to create an algorithm of trial and error until we get the best model possible from the statistics point of view and also the model to be economically sound (perhaps don’t apply with other disciplines). When the algorithm is running, R2 is the relevant measure. R2 adj may be used as an appropriate measure when comparing models with a different number of variables, however, since we tend to comply with the model assumptions (enough data points, parsimonious building, errors behavior, etc), R2 adj difference from R2 will never be critical.
In the scenario of exotic models using exotic data or too much proxy variables, a model’s calculated statistics can shed tricky differences as you may have encountered in your practice, however we shouldn’t generalize this.
Indeed, R2 never decreases, it will just increase minimally when adding new variables. However, as I said above, in the seek of a parsimonious model and under the trial and error algorithm, we must replace variables with new ones with the expectation of a better result. This is why I said R2 captures junk variables better than R2 adjusted. If you see the R2 adj formula you can realize it is a shi.tty adjustment. Here comes my comment about the “simulation” I made about R2 vs R2 adj. They will never differ much in a parsimonious model with a good data set, and that is all I wanted to demonstrate. Your proposed simulation is indeed easy to perform but laborious, too much for a simple post. If someday I get a good data I will run that simulation and share with you the results. The main sensitive controls would be changing the size of the data (keeping representativeness correct) and the number of variables added without quitting none of the previous ones vs changing the size of the data and replacing existing variables (keeping up the parsimonious law). What does your intuition say about?
I can assure you that I’m not talking from feelings because I use words like “would” or “believe”. They are just soft words to do not sound too much aggressive.
Also, I’m not an specialist in the subject but I can tell you that you may be overcomplicating yourself. I wil say (again) that CFAI books are not that bad for the target people: financial analysts. I do know the curriculum has some theory mistakes and obvious some definitions and explanations, but probably made at purpose because of the scope and length. If you see the curriculum in detail (of each level) you will realize that nobody will never become an specialist in any subject just by reading some 300-page books, that is a total nonsense. I tell you this because I can see on you some of the hate that torments some people when are disappointed about something. I don’t expect much by getting the CFA charter, it is just a certification. I will change my own career and life by myself, learning and practicing everything, not by relying in an education program.
Cheers.

LOS 8.i Multiple Regression and ANOVA

archived_user

New member

archived_user

New member