0
Posted on 4:36 PM by 4 8 15 16 23 42 and filed under

Again last semester, for my Statistical Inference Course I was supposed to perform Multiple Regression Analysis on a dataset. In this post, I will share my notes on finding the data and steps of the study and about software selection.

1.        Finding the Data:

At the beginning I thought  it’s a great oppurtunity to investigate Global Warming. In regression analysis, you try to explain one variable as the function of others and make inferences. So you have several variables namely predictor variables( the ones you use in prediction of the response variable) and one response variable( the one you are interested in).
Coming back to Global Warming example,  my initial idea was trying to predict temperature by using C02 level, deforrestation level etc. Why didn’t I do it? Maybe “searching for hours for these data and not finding as I wanted”  is the reason.
The challenging part in finding data was that for each different data the sources are different. You have to find different official data from different sources since they are not published together in most of the cases. I could find Temperature Data but I was close to finding CO2 but 2 more problems:

1.       There is need to organize the data / Time- Consuming + Time- Consuming+ Time- Consuming
2.       I needed more predictor variables (10 would be enough but I had only 2) since in the analysis you decide on whether to keep or omit some variables.

   So I gave up my idea on global warming which as a subject looked pretty interesting but finding data and organizing them seemed challenging thinking of the above problems. So what I did? Found a sample dataset.
Coming up with a data yourself, and organizing is not really the purpose of regression analysis. In most of the cases, you are supposed to apply regression techniques on a dataset, no matter what your dataset is.  Apparently there are statistics websites presenting datasets for statistical purposes. The great part about  that is they present you many sets and you can choose among them which one you are interested in.
http://lib.stat.cmu.edu/ is an example statistical website. (The one I found mydataset) The datasets presented here, are not made up to give nice results, instead they are taken from studies or research papers some time ago.
http://bus.utk.edu/Stat/DataMining/files.htm : This site also gives dataset which you may use for regression. I got the impression that Senic dataset is a very famous one and could be used without hesitation.
You may look at these websites also:

2.        Steps of Multiple Linear Regression Analysis


Test the Signifance of Regression : It’s a basic F test. You do test to check whether regression analysis is really helping you explain a significant amount of variance on the  response variable.   

Diagnostics and Remedial Measures  based on all predictor variables : Initially start with the full model where you put all your data. By looking at residual plots you decide whether assumptions below are violated or not, and if they are by transformations you try to correct them.
·         Multicolinearity Between Predictor Variables
·         Linearity of Error Terms
·         Constant Variance of Error Terms
·         Independence of Residuals
·         Normality of Residuals

Selection of Significant Predictor Variables: After you made sure that full model satisfies all the regression assumtions, in a whole set of available variables you take the ones which helps you analyze a  significant amount of variance on Y(response variable). You can either use Stepwise, Forward Selection or Backward Elimination in deciding which variables to include in your model.


 Diagnostics and Remedial Measures  based on Selected Predictor variables
·         Linearity of Error Terms              
·         Removing Outliers        
·         Constant Variance of Error Terms
·         Normality of Error Terms


Validation of the Model :  You initally save some of the observations and not include them in regression. Let’s say with 400 observations you calculated an apropriate regression line and  for  100 samples you look at how does your regression line fits.

3.       Which Software to Use:

Well, you can use many statistical softwares for regression but I guess  SPSS and Minitab are the most popular ones. I first started with SPSS thinking that it’s much more popular software and knowledge of Spss would be a plus because its widely used. At the beginning it looks simple, it is indeed but for some simple tasks Spss is just a waste of time.
For example you want to plot all the residual plots and copy them to Ms Word, Excel etc. , in Spss you have to copy manually each graph manually and it takes time. Editing Graphs,or plotting mutliple graphs in Spss is not easy either.
So I looked to Minitab and Result: Minitab Rules! You can plot variables against each other very easily and save them all in .jpg  files. Plus it seems to me that in Minitab you have more options in regression, so I would definitely suggest you to use Minitab. Also, trial version of Minitab  is available for 30 days( Way enough for regression projectJ ).  

I hope this blog post helps you in your analysis.

End of Course-Related Blog Post Series!!
{ at least for a while}

Arda
0
Responses to ... Multiple Regression Analysis

Post a Comment