Monday, November 11, 2013

Regressing \(\ln(Y)\) instead of \(Y\)

TLDR: If you have an estimate for \(Z\), you can't just take \(e^{estimate}\) to estimate \(e^Z\)

A bias correction factor of \(e^{\hat\sigma^2/2}\) has to be applied on the "common sense" estimator \(e^{\hat{E(Z)}}\), to correctly estimate \(Y=e^Z\). The right estimate is \(Y=e^Z\ \hat =\ e^{\hat \sigma^2/2}e^{\hat{E(Z)}}\).


Let's take an attribute \(Y\) which has a lognormal distribution - e.g. Income, Spend, Revenue etc. Since \(\ln(Y)\sim N(\mu,\sigma^2)\), we may choose to model \(\ln(Y)=Z\) instead of \(Y\), and aspire to get a better estimate of \(Y\) from our estimate of \(\ln(Y)\).

Suppose we model \(Z=\ln(Y)\) instead of \(Y\), so that \(Y=e^Z\). We estimate \(E(Z)=\mu\ \hat=\ \hat \mu= f(X)\) based on independent variables \(X\). (Read the symbol \(\hat =\) as "estimated as".)

Given \(\hat \mu\) estimates \(E(Z)\), a common-sense option to estimate \(E(Y)\) might seem to be \(e^\hat \mu\), since \(Y=e^Z\).

But this will not give the best results - simply because \(E(Y)=E(e^Z)\ne e^{E(Z)}\).

\(E(Y)=e^{\mu+\sigma^2/2}\), where \(\sigma^2\) is the variance of the error \(Z-\hat Z\) - and hence a good estimate of \(E(Y)\) would  be \(E(Y)\ \hat=\ e^{\hat \mu+\hat \sigma^2/2}\).

Estimating \(\sigma^2\)

We are used to estimating \(E(Z)\hat=\hat \mu\), which is the just the familiar regression estimate \(\sum \hat \beta_i X_i\). We will need to estimate \(\hat\sigma^2\) now too, to get an accurate point estimate of \(Y=e^Z\).

OLS


If you are running an Ordinary Least Squares regression, an unbiased estimate for \(\sigma^2\) is \(\frac{SSE}{n-k}\) where \(n\)=#observations, and \(k\)=#parameters in the model.

Most statistical packages report these - and if not, you can calculate it as \(\sum (Z-\hat Z)^2/(n-k)\). SAS reports all these if you use PROC REG, in fact, in SAS \(\hat \sigma\) is already reported as "Root MSE", and you can directly take \(\text{Root MSE}^2\) as an estimate of \(\sigma^2\).

Other Regression Frameworks (Machine Learning - RandomForest, NN, KNN, etc.)


A generic way of estimating the \(\sigma^2\) is to borrow the assumption of homoscedasticity from OLS - i.e. that the \(\sigma^2\) does not vary from person to person.

Under this assumption, CLT can be used to show that \(\sum (Z-\hat Z)^2/n\) will converge in probability to \(\sigma^2\), and hence remains a good estimator - even if it may not be unbiased for small \(n\).

If number of parameters in the model is known, then it is recommended to use \(\sum (Z-\hat Z)^2/(n-k)\), mimicking the OLS estimator - it will correct for the bias to some extent, although for large \(n\), the difference between \(1/(n-k)\) and \(1/n\) will be small.

 

Proof

If \(Z=\ln(Y)\sim N(\mu,\sigma^2)\), then \(E(Y)=E(e^Z)=e^{\mu+\sigma^2/2}\).

Citing the mean of lognormal distribution in Wikipedia may work as "proof" in most cases. Just for completeness, a full mathematical proof is also given below.

\[E(e^Z)=\int_{-\infty}^{\infty}{e^z\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(z-\mu)^2}{2\sigma^2}}}\,dz=\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\color{#3366FF}{(z-\mu)^2-2\sigma^2 z}}{2\sigma^2}}}\,dz\]
 \[\begin{array}{rcl}
\color{#3366FF}{(z-\mu)^2-2\sigma^2 z}&=&z^2-2\mu z + \mu^2-2\sigma^2z\\
&=&z^2-2(\mu+\sigma^2) z + \mu^2\\
&=&\left(z-(\mu+\sigma^2)\right)^2 + \mu^2-(\mu+\sigma^2)^2\\
&=&\left(z-(\mu+\sigma^2)\right)^2 - 2\mu\sigma^2-\sigma^4\\
&=&\color{green}{\left(z-(\mu+\sigma^2)\right)^2} - \color{red}{2\sigma^2(\mu+\sigma^2/2)}\\
\end{array}\]
\[\begin{array}{rcl}
E(e^Z)&=&\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\color{green}{\left(z-(\mu+\sigma^2)\right)^2} - \color{red}{2\sigma^2(\mu+\sigma^2/2)}}{2\sigma^2}}}\,dz\\
&=&\color{red}{e^{\mu+\sigma^2/2}}\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\left(z-(\mu+\sigma^2)\right)^2}{2\sigma^2}}}\,dz\\
&=&\color{red}{e^{\mu+\sigma^2/2}}\\
\end{array}\]

Friday, August 16, 2013

Random Forest: Per prediction confidence

Random Forest

Random Forest is a machine learning technique to automatically classify data into their most likely category based on a training sample. For example, you may want to categorize a movie as good or bad, a drug as effective or ineffective on a given patient, a bitmap image into the digits 0-9.

To perform this feat, before RF 'looks' at a sample with known categories, and 'learns' from it. This is called the 'training' process.

Since it uses a training data to learn, the method is called supervised learning, as opposed to unsupervised learning where a data is categorized into its most likely clusters without having any hint of what those clusters may be from any previously available data sources.

Some other supervised learning techniques are Logistic Regression, Neural Networks, Support Vector Machines, and Gradient Boosting Machines.

A Random Forest, as the name indicates, generates a collection of trees - viz. Decision Trees - at the end of the learning process. Each tree by itself is a predictor. To categorize a new observation, it is run through all the trees, and the results are aggregated.



Testing Random Forest

How well does it work?

To test it, I created a data that where I color a point \((x,y)\) as Green if \(y>\sin(x)\), otherwise I call it Red.

Note conventional methods like Logistic Regression, \(P(green)=\left({1+e^{-(\alpha+\beta_1 x + \beta_2 y)}}\right)^{-1}\), would fail on this data, as it depends on the relationships being linear.

Test Function

Then I created samples of different sizes consisting of \((x,y)\) points, applied the red/green labeling for training the RF.

Training Samples


As I was interested to see the quality of the prediction varies on the amount of data, I fed the above samples separately. The number of trees that RF generates, which upto a point increases the predictive power, was also varied. It was RF's job to come up with the function just by looking these points.

Guess made by Random Forest


Looking at this pictures, it it apparent that RF recreated the functions very accurately, even with as low as 100 observations - for which it may actually be difficult for a person to guess the function.

For our simple data, the incremental gain by increasing the number of trees from 50 to 1000 is small.

Confidence

RF has a natural candidate that can be used to estimate how confident it is for each prediction. If there are \(n\) trees, we can simply compute the variance of the predictions \(v_k=\frac{1}{n}\sum_{i=1}^n(\hat p_k-\frac{1}{n}\sum(\hat p_k))\) for each observation.

A sense of the confidence can be seen by looking at the white region in the charts above. Below is a plot of the relative standard deviation which brings it out clearly.

Relative Uncertainty of Random Forest

Darker areas correspond to regions of higher uncertainty.

Observe that the most uncertain regions are at the boundary region, as one would expect, since it is at the boundary you will be most unsure about what color a point will have.

Also as the size of the training sample is increased, uncertainty recedes more and more towards the boundary of the classes.

Conclusion

Per each data point, the variance of outcomes by all the trees can be used to estimate the confidence in its prediction. It is valuable especially for estimating continuous attributes (though the simple example above demonstrates this on discrete outcomes), but it is generally overlooked. In a scenario where a bad decision is costly, the confidence can be used to filter the data before taking a decision.

Monday, November 19, 2012

We need a race of superhumans to prevent annihilation

In today's world, it is no longer possible for one human to understand entire gamut of technology available to us. The technology we created has surpassed us. This happened roughly over the last 100 years. That this happened, is not surprising.

Technology is bound to surpass us. We evolve at snail's pace, and essentially are the same creatures that we were 1000s of years ago. We evolved for 3.5 billion years, give or take. Technology however started to evolve, externally, through our minds, since we first had language. And its 'genes', are thoughts and documentation of science - which is far more malleable and leads to a far more guided 'evolution'. It caused technology to very rapidly outpace the biological, Darwinian evolution of our brain. The only thing that differentiates us today from the savage, aggressive, primitive apes that we were at the dawn of humanity, is technology.

Darwinian evolution, which will improve us eventually, cannot simply cope with the pace at which knowledge evolves. As a result, certain qualities - for example impulsive aggressiveness - which helped us once have now become very real threat for our own survival.

I fear if we do not evolve, we will not be able to handle so much power that technology offers us, with the responsibility such power demands. The most likely scenario of humanity's end today is through nuclear detonations - which, as an event in today's world, is press of a button away for some people. I hope that we learn to control our primitive impulses before that happens. The scary thought is that given enough time and no change, laws of probability dictates that it is one day bound to happen.

The only way I see for us (humanity) to survive is if we learn to engineer the genes of the future generations, to improve them by enhancing rational thoughts over aggressive impulses. I hope we can withhold the itch to press that red button till we make that final technological leap.

(The only other alternative outcome I see is machines being eventually given the intelligence to take over - which can lead to a habitable world of intelligent beings, but probably ones not remotely like humans.)

TL;DR: We need to create biologically engineered race of superhumans with the right mind to prevent our annihilation.

Saturday, August 25, 2012

Mayabono Biharini Horini - Lyrics & Translation


This is an awesome song by Tagore, which also featured recently in the movie Bedroom with some interesting background music of rock.

Translating the quality of the words and the language is like translating Shakespeare - and is beyond my ability. But I cannot stand by - and feel I should do whatever I can to atleast carry the meaning to non-native speakers. So here's the English translation, which is more of an explanation to understand the song, and not to be taken as a song that can be sung.

-------------------------------------


Maya bono biharini horini
Gohono sopono soncharini -
Keno tare dhoribare kori pon?
Okaron?

    Wandering doe of the enchanted forest
    Who roams withinin deepest (of my) dreams -
    Why do I want to capture her?
    Is there a reason?

Thak thak nijo mone dure te.
Ami shudhu banshuriro sure te -
Porosho koribo or prano-mon,
Okaron

    Let her be by herself afar.
    I'll only through the melody of my flute -
    Touch her life and soul,
    For no reason.

Chomokibe phagunero pobone
Poshibe akash bani shrobone.
Chitto akul hobe anukhon,
Okaron

    (She) will startle in the monsoon wind
    Will surrender as the heavens speak.
    Will grow restless now and then,
    For no reason.

Dur hote ami tare saadhibo
Gopone biroho dore bandhibo
Bandhono bihono sei - je badhon,
Okaron

    I will yearn her from afar,
    I will secretly fasten her with the rope of separation,
    A tie without a tie.
    For no reason.

Saturday, August 18, 2012

Google Maps 6.10.0 update breaks feature

I upgraded to Maps 6.10.0 for Android today, and with much sadness I noticed yet another feature from the app is gone. You cannot anymore search for places which you have starred. In previous versions, you could type in the description of a starred location, and it would pop up at the top of your list. The picture below shows an example of what I mean -



If you clicked the first entry, which corresponds to some place I marked with a star myself, it will immediately take you to that place in the map. Showing up as the first entry when searched is really the expected behavior as explained in Google's support page -



Now this only happens only if you have bad Internet connection - in other words assuming you are connected to Internet, starred places almost never will even pop up anywhere in the list.

In fact this feature was broken since the 6.9.0 update. Another notable miss is the ability to have your places sorted by distance and direction in the My Places list - now they are only sorted by the date when you starred them.

This is unacceptable to me and hence I reverted to 6.8.0. The sad thing is Google has a long history of closing its eye on feedback - so I do not even expect a fix. Till now it is okay because I do not need the new features added - which are mostly UI enhancements and other additions not very useful to me. I will will just cross my fingers and hope that the last good version 6.8.0, will keep on working if they continue to add more 'features' - and that I will not have to choose from something really useful versus this in the future.

Tuesday, June 12, 2012

Password hashing with bookmarklet


This entry is to explain the concept of password hashing, and empower you with such a tool to have strong and unique passwords per site without having to remember them.

Many users tend to maintain few passwords which they repeatedly use for multiple websites. Even if the password is strong, this has a problem. With so many logins maintained by different website managers, it is inevitable that some of them will get compromised eventually. And the chances are also high enough that the passwords of the hacked sites will get stored as a cleartext.

What that will amount to is that if one of your passwords get stolen, you cannot be sure that the hacker won't get access to other sites which you use.

This is where comes tools like Password Hashing.

A password hasher takes a (possible weak) password, and creates a strong password depending on your password and the domain of the site you are going to log in to.

So for example, if you want to log into Linkedin, and type 'ABC' as password, you click the hasher - and viola - the password get's transformed into something like 'Q%JhLHTJ@4'. Next you want to login to login to Facebook, and type in 'ABC' again, click, and now you get '=ygDD-R46G' - because the generated password also depends on domain.

You need to remember only one password - 'ABC' in this case - which got transformed into something much stronger, and unique per domain - alleviating a lot of security issues. If Linkedin passwords get stolen, you are sure that there's no way the hacker will get access to your Facebook account even though you remembered effectively one password. Also, depending on the number of characters you choose, the generated password can be very strong. For example both the generated passwords above will take roughly 78 thousand years to crack, even assuming a speed of 1 million guesses per second - which is a very high rate of attack for an online server to cope with.

What if someone else also uses the same weak password 'ABC' for Linkedin? He will get a different encrypted password, provided you have customized your password hasher with a strong master password.

You can get your customized password hasher bookmarklet from here - Superpass Password Hasher.

Tuesday, March 20, 2012

Galaxy S2 ICS 4.0.3 Review

Recently I upgraded my Galaxy S2 to Ice Cream Sandwich 4.0.3 from Ginger Bread 2.3.6. I grabbed the DXLP7 firmware from xda, and upgraded through Odin. It was painless - even though I started the process with only 42% remaining battery. The upgrade was complete in about 5 minutes, with all my previous apps and settings intact.

Only piece of advice, the upgrade does loose your home page settings. It will help if you create screenshots of your phone before upgrading.

Overall I was pleasantly surprised with the brand new OS, it almost gave me the feeling of having a new phone. There are numerous small upgrades which makes life easier. Though all is not as great as it could be, it is definitely recommended.

Here's my review of the new experience.

Positive -
1. Video/Voice chat in GTalk is finally here in Galaxy S2, officially. Also the interface is revamped as it is the new ICS app.
2. Inbuilt data usage meter/limiter which is very well done indeed. This may well put 3G Watchdog out of business.
3. Folders have small icons of apps inside them.
4. Upload/Download arrows have been added for Wifi connection as well!
5. Improved contacts sync API enables third party apps to integrate better - also now you see which apps have synced with a contact in contact list as icons. (However all is not good on contacts front, read on.)
6. Much better browser. The only thing missing is ability to reflow text when zoomed on the fly. Also you can now install Chrome beta which is exclusive to ICS.
7. A hoard of Developer Options which give you a very good glimpse of what is happening within the phone. Some, like warning when an app takes up CPU cycles for long time, or show CPU usage overlay, are useful to even non developer power users.
8. Face unlock - gimmick, good to have nonetheless.
9. Ability to show a message in locked screen, is useful if your phone is lost and some good soul comes across it :)
10. Rotation animations - Like the Galaxy Tab, now when you change the orientation, whatever is on your screen rotates with a sleek animation.
11. GMail interface is revamped with the new ICS GMail App.
12. Cool screen rotation animation added when you rotate your phone and an application changes from portrait to landscape, or vice versa.


Negative -
1. Contacts & Phone (dialer) are now separated. There is no quick way to move to phone dial pad from contacts anymore! In my mind, this is the biggest and only dumb decision made for ICS.
2. Call recording libraries not implemented yet. There's word that even the call recording giant Total Recall cannot record calls on ICS. If call recording is something you need, do not upgrade.
Update - workaround exists to solve this issue.
3. Contacts now has all (unwanted) gmail addresses. Also I used Social Hub to sync Facebook contacts since it provided pictures, but now it is not supported anymore - I use the Facebook app to sync contacts, and the picture sync does not seem to be working properly.
Update - you can use this free app to sync contacts with hi-res (or optionally low-res) pictures.
4. Camera focus and shutter sounds are no longer linked to System volume (or any other volume), both now play either at max volume (if phone is not on mute), or don't sound at all (if it is on mute).
5. Google Talk now either has option show all friends sorted by name, or to show only 'popular' friends sorted by availability. There's no way to see all friends sorted by availability. Big step back.

Watchouts - it may seem some things are missing, but fear not - merely how to do this has changed.
1. Screenshot - Previously it was Home+Power. Now it is to press Power+VolumeDown simultaneously and hold for a second or two.