Improving Linear Regression Model for Categorical data.

User 4603 | 4/11/2016, 9:31:21 AM

I want some help on applying linear regression model.

I have a data file with 12k training examples and 2k test examples.

Each training example have 2 features. Out of which, 1 feature have float values and other have categorical data with around 1400 unique categories. I am getting a very bad rmse(16000) and correlation coefficients(84%).

Kindly help me in deciding what to do next to improve my model.


User 91 | 4/11/2016, 4:54:09 PM

Its extremely hard to say why exactly the RMSE is bad without more information. It could be one of many things - You may need better features (understanding the domain problem deeply helps engineer better features) - You may have some bad data that is causing the regression model to get confused (analysis of where the errors are high/low can help in this situation) - You may be overfitting (i.e the training error is low and the test error is high). Cross validation can help here.