Andrew Ng's Paper Was Wrong
Despite what everyone has told you, random splits don't always work.
Thanks to our sponsors who keep this newsletter free to everyone.
This week’s issue is brought to you by Brilliant.org, a platform that turns the idea of a course upside down. Most courses try to cram a lot of information into your brain. Brilliant does something different: bite-sized, interactive lessons that teach you from first principles. Check them out for free for 30 days!
In Machine Learning circles, few names ring a bell louder than Andrew Y. Ng does.
If not the most popular, I don’t think you’ll find an educator that has changed more lives with his courses and contributions to the field.
And that’s why I had to write about this.
In 2017, Andrew led a team that published a paper showing a Deep Learning model to detect pneumonia: “CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning.”
Eleven days later, they had to publish a correction.
The team made a huge mistake.
To build a classification model, the team annotated every image showing pneumonia as positive and labeled all other pictures as negative. Their dataset contained 112,120 images from 30,805 unique patients, and they randomly split it into 80% training and 20% validation.
And that split was the problem.
Notice how they had around four times more images than patients. This means that at least some patients contributed more than one picture to the data.
That wouldn’t be an issue, except when using a random split.
Imagine that one patient had a scar from a previous surgery. This scar would be visible in every x-ray image. The random split could send some of these images to the train set and others to the validation set.
Do you see where I’m going with this?
The model may use the scar to label every image from the same patient the same way. Assuming this patient had pneumonia, the model might conclude that any validation image with the same scar would also belong to the positive class.
The team inadvertently created a leaky validation strategy!
Machine Learning models are sneaky little bastards that use any available shortcuts to optimize their evaluation metric. Randomly splitting the dataset gave the model exactly what it was looking for.
A dataset with correlation or groupings between individual samples is not a good candidate for random splits. If information leaks from the training data into the validation data, your validation score will look much better than it should be.
The team updated their paper a few days later. This time, they changed their training strategy. Here is an excerpt explaining the new approach:
They split the data by keeping images from the same patient together. This approach removed the overlap between the sets and fixed the issue.
I took a couple of lessons from this story:
First, although splitting a dataset randomly is one of the most popular techniques you’ll learn, you must be careful. Many times I’ve used random splits without thinking much about it. I now take a deep breath and spend more time understanding the data to avoid a leaky validation strategy.
The second lesson has been much more helpful: everyone makes mistakes, even those leading our field. Mistakes are part of the process.
What truly matters is what we learn from them.
200 Members!
Over the last two weeks, 200 people joined the Machine Learning School! I planned for 10, and I thought 50 was out of reach.
I was wrong.
Thank you, everyone, for the support!
I’ll leave you here with one of the slides we’ll see during the first cohort so that you can admire my drawing abilities.
And for those who read this far, here is a 20% coupon.
Great insight!
Hi Santiago: I had just signed-up for the course today when I read this note and learned about the promo code. Is it possible to get the promo discount credited to my payment? Looking forward to the course! Thx.
- Juan