Opinions 5: 2013

Monday, November 11, 2013

Regressing \(\ln(Y)\) instead of \(Y\)

TLDR: If you have an estimate for \(Z\), you can't just take \(e^{estimate}\) to estimate \(e^Z\)

A bias correction factor of \(e^{\hat\sigma^2/2}\) has to be applied on the "common sense" estimator \(e^{\hat{E(Z)}}\), to correctly estimate \(Y=e^Z\). The right estimate is \(Y=e^Z\ \hat =\ e^{\hat \sigma^2/2}e^{\hat{E(Z)}}\).

Let's take an attribute \(Y\) which has a lognormal distribution - e.g. Income, Spend, Revenue etc. Since \(\ln(Y)\sim N(\mu,\sigma^2)\), we may choose to model \(\ln(Y)=Z\) instead of \(Y\), and aspire to get a better estimate of \(Y\) from our estimate of \(\ln(Y)\).

Suppose we model \(Z=\ln(Y)\) instead of \(Y\), so that \(Y=e^Z\). We estimate \(E(Z)=\mu\ \hat=\ \hat \mu= f(X)\) based on independent variables \(X\). (Read the symbol \(\hat =\) as "estimated as".)

Given \(\hat \mu\) estimates \(E(Z)\), a common-sense option to estimate \(E(Y)\) might seem to be \(e^\hat \mu\), since \(Y=e^Z\).

But this will not give the best results - simply because \(E(Y)=E(e^Z)\ne e^{E(Z)}\).

\(E(Y)=e^{\mu+\sigma^2/2}\), where \(\sigma^2\) is the variance of the error \(Z-\hat Z\) - and hence a good estimate of \(E(Y)\) would be \(E(Y)\ \hat=\ e^{\hat \mu+\hat \sigma^2/2}\).

Estimating \(\sigma^2\)

We are used to estimating \(E(Z)\hat=\hat \mu\), which is the just the familiar regression estimate \(\sum \hat \beta_i X_i\). We will need to estimate \(\hat\sigma^2\) now too, to get an accurate point estimate of \(Y=e^Z\).

OLS

If you are running an Ordinary Least Squares regression, an unbiased estimate for \(\sigma^2\) is \(\frac{SSE}{n-k}\) where \(n\)=#observations, and \(k\)=#parameters in the model.

Most statistical packages report these - and if not, you can calculate it as \(\sum (Z-\hat Z)^2/(n-k)\). SAS reports all these if you use PROC REG, in fact, in SAS \(\hat \sigma\) is already reported as "Root MSE", and you can directly take \(\text{Root MSE}^2\) as an estimate of \(\sigma^2\).

Other Regression Frameworks (Machine Learning - RandomForest, NN, KNN, etc.)

A generic way of estimating the \(\sigma^2\) is to borrow the assumption of homoscedasticity from OLS - i.e. that the \(\sigma^2\) does not vary from person to person.

Under this assumption, CLT can be used to show that \(\sum (Z-\hat Z)^2/n\) will converge in probability to \(\sigma^2\), and hence remains a good estimator - even if it may not be unbiased for small \(n\).

If number of parameters in the model is known, then it is recommended to use \(\sum (Z-\hat Z)^2/(n-k)\), mimicking the OLS estimator - it will correct for the bias to some extent, although for large \(n\), the difference between \(1/(n-k)\) and \(1/n\) will be small.

Proof

If \(Z=\ln(Y)\sim N(\mu,\sigma^2)\), then \(E(Y)=E(e^Z)=e^{\mu+\sigma^2/2}\).

Citing the mean of lognormal distribution in Wikipedia may work as "proof" in most cases. Just for completeness, a full mathematical proof is also given below.

\[E(e^Z)=\int_{-\infty}^{\infty}{e^z\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(z-\mu)^2}{2\sigma^2}}}\,dz=\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\color{#3366FF}{(z-\mu)^2-2\sigma^2 z}}{2\sigma^2}}}\,dz\]
\[\begin{array}{rcl}
\color{#3366FF}{(z-\mu)^2-2\sigma^2 z}&=&z^2-2\mu z + \mu^2-2\sigma^2z\\
&=&z^2-2(\mu+\sigma^2) z + \mu^2\\
&=&\left(z-(\mu+\sigma^2)\right)^2 + \mu^2-(\mu+\sigma^2)^2\\
&=&\left(z-(\mu+\sigma^2)\right)^2 - 2\mu\sigma^2-\sigma^4\\
&=&\color{green}{\left(z-(\mu+\sigma^2)\right)^2} - \color{red}{2\sigma^2(\mu+\sigma^2/2)}\\
\end{array}\]
\[\begin{array}{rcl}
E(e^Z)&=&\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\color{green}{\left(z-(\mu+\sigma^2)\right)^2} - \color{red}{2\sigma^2(\mu+\sigma^2/2)}}{2\sigma^2}}}\,dz\\
&=&\color{red}{e^{\mu+\sigma^2/2}}\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\left(z-(\mu+\sigma^2)\right)^2}{2\sigma^2}}}\,dz\\
&=&\color{red}{e^{\mu+\sigma^2/2}}\\
\end{array}\]

Friday, August 16, 2013

Random Forest: Per prediction confidence

Random Forest

Random Forest is a machine learning technique to automatically classify data into their most likely category based on a training sample. For example, you may want to categorize a movie as good or bad, a drug as effective or ineffective on a given patient, a bitmap image into the digits 0-9.

To perform this feat, before RF 'looks' at a sample with known categories, and 'learns' from it. This is called the 'training' process.

Since it uses a training data to learn, the method is called supervised learning, as opposed to unsupervised learning where a data is categorized into its most likely clusters without having any hint of what those clusters may be from any previously available data sources.

Some other supervised learning techniques are Logistic Regression, Neural Networks, Support Vector Machines, and Gradient Boosting Machines.

A Random Forest, as the name indicates, generates a collection of trees - viz. Decision Trees - at the end of the learning process. Each tree by itself is a predictor. To categorize a new observation, it is run through all the trees, and the results are aggregated.

Testing Random Forest

How well does it work?

To test it, I created a data that where I color a point \((x,y)\) as Green if \(y>\sin(x)\), otherwise I call it Red.

Note conventional methods like Logistic Regression, \(P(green)=\left({1+e^{-(\alpha+\beta_1 x + \beta_2 y)}}\right)^{-1}\), would fail on this data, as it depends on the relationships being linear.

Test Function

Then I created samples of different sizes consisting of \((x,y)\) points, applied the red/green labeling for training the RF.

Training Samples

As I was interested to see the quality of the prediction varies on the amount of data, I fed the above samples separately. The number of trees that RF generates, which upto a point increases the predictive power, was also varied. It was RF's job to come up with the function just by looking these points.

Guess made by Random Forest

Looking at this pictures, it it apparent that RF recreated the functions very accurately, even with as low as 100 observations - for which it may actually be difficult for a person to guess the function.

For our simple data, the incremental gain by increasing the number of trees from 50 to 1000 is small.

Confidence

RF has a natural candidate that can be used to estimate how confident it is for each prediction. If there are \(n\) trees, we can simply compute the variance of the predictions \(v_k=\frac{1}{n}\sum_{i=1}^n(\hat p_k-\frac{1}{n}\sum(\hat p_k))\) for each observation.

A sense of the confidence can be seen by looking at the white region in the charts above. Below is a plot of the relative standard deviation which brings it out clearly.

Relative Uncertainty of Random Forest

Darker areas correspond to regions of higher uncertainty.

Observe that the most uncertain regions are at the boundary region, as one would expect, since it is at the boundary you will be most unsure about what color a point will have.

Also as the size of the training sample is increased, uncertainty recedes more and more towards the boundary of the classes.

Conclusion

Per each data point, the variance of outcomes by all the trees can be used to estimate the confidence in its prediction. It is valuable especially for estimating continuous attributes (though the simple example above demonstrates this on discrete outcomes), but it is generally overlooked. In a scenario where a bad decision is costly, the confidence can be used to filter the data before taking a decision.