Opinions 5: Regressing \(\ln(Y)\) instead of \(Y\)

TLDR: If you have an estimate for \(Z\), you can't just take \(e^{estimate}\) to estimate \(e^Z\)

A bias correction factor of \(e^{\hat\sigma^2/2}\) has to be applied on the "common sense" estimator \(e^{\hat{E(Z)}}\), to correctly estimate \(Y=e^Z\). The right estimate is \(Y=e^Z\ \hat =\ e^{\hat \sigma^2/2}e^{\hat{E(Z)}}\).

Let's take an attribute \(Y\) which has a lognormal distribution - e.g. Income, Spend, Revenue etc. Since \(\ln(Y)\sim N(\mu,\sigma^2)\), we may choose to model \(\ln(Y)=Z\) instead of \(Y\), and aspire to get a better estimate of \(Y\) from our estimate of \(\ln(Y)\).

Suppose we model \(Z=\ln(Y)\) instead of \(Y\), so that \(Y=e^Z\). We estimate \(E(Z)=\mu\ \hat=\ \hat \mu= f(X)\) based on independent variables \(X\). (Read the symbol \(\hat =\) as "estimated as".)

Given \(\hat \mu\) estimates \(E(Z)\), a common-sense option to estimate \(E(Y)\) might seem to be \(e^\hat \mu\), since \(Y=e^Z\).

But this will not give the best results - simply because \(E(Y)=E(e^Z)\ne e^{E(Z)}\).

\(E(Y)=e^{\mu+\sigma^2/2}\), where \(\sigma^2\) is the variance of the error \(Z-\hat Z\) - and hence a good estimate of \(E(Y)\) would be \(E(Y)\ \hat=\ e^{\hat \mu+\hat \sigma^2/2}\).

Estimating \(\sigma^2\)

We are used to estimating \(E(Z)\hat=\hat \mu\), which is the just the familiar regression estimate \(\sum \hat \beta_i X_i\). We will need to estimate \(\hat\sigma^2\) now too, to get an accurate point estimate of \(Y=e^Z\).

OLS

If you are running an Ordinary Least Squares regression, an unbiased estimate for \(\sigma^2\) is \(\frac{SSE}{n-k}\) where \(n\)=#observations, and \(k\)=#parameters in the model.

Most statistical packages report these - and if not, you can calculate it as \(\sum (Z-\hat Z)^2/(n-k)\). SAS reports all these if you use PROC REG, in fact, in SAS \(\hat \sigma\) is already reported as "Root MSE", and you can directly take \(\text{Root MSE}^2\) as an estimate of \(\sigma^2\).

Other Regression Frameworks (Machine Learning - RandomForest, NN, KNN, etc.)

A generic way of estimating the \(\sigma^2\) is to borrow the assumption of homoscedasticity from OLS - i.e. that the \(\sigma^2\) does not vary from person to person.

Under this assumption, CLT can be used to show that \(\sum (Z-\hat Z)^2/n\) will converge in probability to \(\sigma^2\), and hence remains a good estimator - even if it may not be unbiased for small \(n\).

If number of parameters in the model is known, then it is recommended to use \(\sum (Z-\hat Z)^2/(n-k)\), mimicking the OLS estimator - it will correct for the bias to some extent, although for large \(n\), the difference between \(1/(n-k)\) and \(1/n\) will be small.

Proof

If \(Z=\ln(Y)\sim N(\mu,\sigma^2)\), then \(E(Y)=E(e^Z)=e^{\mu+\sigma^2/2}\).

Citing the mean of lognormal distribution in Wikipedia may work as "proof" in most cases. Just for completeness, a full mathematical proof is also given below.

\[E(e^Z)=\int_{-\infty}^{\infty}{e^z\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(z-\mu)^2}{2\sigma^2}}}\,dz=\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\color{#3366FF}{(z-\mu)^2-2\sigma^2 z}}{2\sigma^2}}}\,dz\]
\[\begin{array}{rcl}
\color{#3366FF}{(z-\mu)^2-2\sigma^2 z}&=&z^2-2\mu z + \mu^2-2\sigma^2z\\
&=&z^2-2(\mu+\sigma^2) z + \mu^2\\
&=&\left(z-(\mu+\sigma^2)\right)^2 + \mu^2-(\mu+\sigma^2)^2\\
&=&\left(z-(\mu+\sigma^2)\right)^2 - 2\mu\sigma^2-\sigma^4\\
&=&\color{green}{\left(z-(\mu+\sigma^2)\right)^2} - \color{red}{2\sigma^2(\mu+\sigma^2/2)}\\
\end{array}\]
\[\begin{array}{rcl}
E(e^Z)&=&\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\color{green}{\left(z-(\mu+\sigma^2)\right)^2} - \color{red}{2\sigma^2(\mu+\sigma^2/2)}}{2\sigma^2}}}\,dz\\
&=&\color{red}{e^{\mu+\sigma^2/2}}\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\left(z-(\mu+\sigma^2)\right)^2}{2\sigma^2}}}\,dz\\
&=&\color{red}{e^{\mu+\sigma^2/2}}\\
\end{array}\]

1 comment:

AnonymousMarch 26, 2019 at 12:39 PM
Augmented Reality in Education
AR and Education — In this era of digitalization, educators know that the learning process should be all about creativity and interaction. Augmented Reality alters one’s ongoing perception and makes it more meaningful by adding digital contents to it. And UniteAR using its innovative technology adds extra digital contents which will give students a wider understanding of topics.

Special kids and AR — As far as special kids are considered Augmented Reality is one of the most excellent ways to educate them. For instance, children who have ADHD and Autism may need special guidance in education as the normal educational techniques may not work for them. UniteAR helps them to learn abstract and difficult concepts with the help of AR technology; as a result learning is made simple.

https://www.unitear.com/https://www.unitear.com/

Monday, November 11, 2013

Regressing \(\ln(Y)\) instead of \(Y\)

Estimating \(\sigma^2\)

OLS

Other Regression Frameworks (Machine Learning - RandomForest, NN, KNN, etc.)

Proof

1 comment: