Linear Digressions

Linear Digressions

lineardigressions.com
Explorations in Machine Learning and Data Science
Heterogeneous Treatment Effects
Jan 20 • 17 min
When data scientists use a linear regression to look for causal relationships between a treatment and an outcome, what they’re usually finding is the so-called average treatment effect. In other words, on average, here’s what the treatment does in terms…
Pre-training language models for natural language processing problems
Jan 13 • 27 min
When you build a model for natural language processing (NLP), such as a recurrent neural network, it helps a ton if you’re not starting from zero. In other words, if you can draw upon other datasets for building your understanding of word meanings, and…
Facial Recognition, Society, and the Law
Jan 6 • 42 min
Facial recognition being used in everyday life seemed far-off not too long ago. Increasingly, it’s being used and advanced widely and with increasing speed, which means that our technical capabilities are starting to outpace (if they haven’t already) our…
Re-release: Word2Vec
Dec 30, 2018 • 17 min
Bringing you another old classic this week, as we gear up for 2019! See you next week with new content. Word2Vec is probably the go-to algorithm for vectorizing text data these days. Which makes sense, because it is wicked cool. Word2Vec has it all:…
Re - Release: The Cold Start Problem
Dec 23, 2018 • 15 min
We’re taking a break for the holidays, chilling with the dog and an eggnog (Katie) and the cat and some spiced cider (Ben). Here’s an episode from a while back for you to enjoy. See you again in 2019! You might sometimes find that it’s hard to get started…
Convex (and non-convex) Optimization
Dec 16, 2018 • 20 min
Convex optimization is one of the keys to data science, both because some problems straight-up call for optimization solutions and because popular algorithms like a gradient descent solution to ordinary least squares are supported by optimization…
The Normal Distribution and the Central Limit Theorem
Dec 9, 2018 • 27 min
When you think about it, it’s pretty amazing that we can draw conclusions about huge populations, even the whole world, based on datasets that are comparatively very small (a few thousand, or a few hundred, or even sometimes a few dozen). That’s the power…
Software 2.0
Dec 2, 2018 • 17 min
Neural nets are a way you can model a system, sure, but if you take a step back, squint, and tilt your head, they can also be called… software? Not in the sense that they’re written in code, but in the sense that the neural net itself operates under the…
Limitations of Deep Nets for Computer Vision
Nov 18, 2018 • 27 min
Deep neural nets have a deserved reputation as the best-in-breed solution for computer vision problems. But there are many aspects of human vision that we take for granted but where neural nets struggle—this episode covers an eye-opening paper that…
Building Data Science Teams
Nov 11, 2018 • 25 min
At many places, data scientists don’t work solo anymore—it’s a team sport. But data science teams aren’t simply teams of data scientists working together. Instead, they’re usually cross-functional teams with engineers, managers, data scientists, and…
Optimized Optimized Web Crawling
Nov 4, 2018 • 19 min
Last week’s episode, about methods for optimized web crawling logic, left off on a bit of a cliffhanger: the data scientists had found a solution to the problem, but it wasn’t something that the engineers (who own the search codebase, remember) liked very…
Optimized Web Crawling
Oct 28, 2018 • 21 min
Got a fun optimization problem for you this week! It’s a two-for-one: how do you optimize the web crawling logic of an operation like Google search so that the results are, on average, as up-to-date as possible, and how do you optimize your solution of…
Better Know a Distribution: The Poisson Distribution
Oct 21, 2018 • 31 min
The Poisson distribution is a probability distribution function used to for events that happen in time or space. It’s super handy because it’s pretty simple to use and is applicable for tons of things—there are a lot of interesting processes that boil…
Searching for Datasets with Google
Oct 14, 2018 • 19 min
If you wanted to find a dataset of jokes, how would you do it? What about a dataset of podcast episodes? If your answer was “I’d try Google,” you might have been disappointed—Google is a great search engine for many types of web data, but it didn’t have…
It’s our fourth birthday
Oct 7, 2018 • 22 min
We started Linear Digressions 4 years ago… this isn’t a technical episode, just two buddies shooting the breeze about something we’ve somehow built together.
Gigantic Searches in Particle Physics
Sep 30, 2018 • 24 min
This week, we’re dusting off the ol’ particle physics PhD to bring you an episode about ambitious new model-agnostic searches for new particles happening at CERN. Traditionally, new particles have been discovered by “targeted searches,” where scientists…
Data Engineering
Sep 23, 2018 • 16 min
If you’re a data scientist, you know how important it is to keep your data orderly, clean, moving smoothly between different systems, well-documented… there’s a ton of work that goes into building and maintaining databases and data pipelines. This job,…
Text Analysis for Guessing the NYTimes Op-Ed Author
Sep 16, 2018 • 18 min
A very intriguing op-ed was published in the NY Times recently, in which the author (a senior official in the Trump White House) claimed to be a minor saboteur of sorts, acting with his or her colleagues to undermine some of Donald Trump’s worst instincts…
The Three Types of Data Scientists, and What They Actually Do
Sep 9, 2018 • 23 min
If you’ve been in data science for more than a year or two, chances are you’ve noticed changes in the field as it’s grown and matured. And if you’re newer to the field, you may feel like there’s a disconnect between lots of different stories about what…
Agile Development for Data Scientists, Part 2: Where Modifications Help
Aug 26, 2018 • 27 min
There’s just too much interesting stuff at the intersection of agile software development and data science for us to be able to cover it all in one episode, so this week we’re picking up where we left off last time. We’ll give a quick overview of agile…
Agile Development for Data Scientists, Part 1: The Good
Aug 19, 2018 • 25 min
If you’re a data scientist at a firm that does a lot of software building, chances are good that you’ve seen or heard engineers sometimes talking about “agile software development.” If you don’t work at a software firm, agile practices might be newer to…
Re - Release: How To Lose At Kaggle
Aug 12, 2018 • 17 min
We’ve got a classic for you this week as we take a week off for the dog days of summer. See you again next week! Competing in a machine learning competition on Kaggle is a kind of rite of passage for data scientists. Losing unexpectedly at the very end of…
Troubling Trends In Machine Learning Scholarship
Aug 5, 2018 • 29 min
There’s a lot of great machine learning papers coming out every day—and, if we’re being honest, some papers that are not as great as we’d wish. In some ways this is symptomatic of a field that’s growing really quickly, but it’s also an artifact of strange…
Can Fancy Running Shoes Cause You To Run Faster?
Jul 29, 2018 • 28 min
The stars aligned for me (Katie) this past weekend: I raced my first half-marathon in a long time and got to read a great article from the NY Times about a new running shoe that Nike claims can make its wearers run faster. Causal claims like this one are…
Compliance Bias
Jul 22, 2018 • 23 min
When you’re using an AB test to understand the effect of a treatment, there are a lot of assumptions about how the treatment (and control, for that matter) get applied. For example, it’s easy to think that everyone who was assigned to the treatment arm…
AI Winter
Jul 15, 2018 • 19 min
Artificial Intelligence has been widely lauded as a solution to almost any problem. But as we justapose the hype in the field against the real-world benefits we see, it raises the question: Are we coming up on an AI winter
Rerelease: How to Find New Things to Learn
Jul 8, 2018 • 18 min
We like learning on vacation. And we’re on vacation, so we thought we’d re-air this episode about how to learn. Original Episode: https://lineardigressions.com/episodes/2017/5/14/how-to-find-new-things-to-learn Original Summary: If you’re anything like…
Rerelease: Space Codes
Jul 2, 2018 • 24 min
We’re on vacation on Mars, so we won’t be communicating with you all directly this week. Though, if we wanted to, we could probably use this episode to help get started. Original Episode: http://lineardigressions.com/episodes/2017/3/19/space-codes…
Rerelease: Anscombe’s Quartet
Jun 24, 2018 • 16 min
We’re on vacation, so we hope you enjoy this episode while we each sip cocktails on the beach. Original Episode: http://lineardigressions.com/episodes/2017/6/18/anscombes-quartet Original Summary: Anscombe’s Quartet is a set of four datasets that have the…
Rerelease: Hurricanes Produced
Jun 18, 2018 • 28 min
Now that hurricane season is upon us again (and we are on vacation), we thought a look back on our hurricane forecasting episode was prudent. Stay safe out there.
GDPR
Jun 10, 2018 • 18 min
By now, you have probably heard of GDPR, the EU’s new data privacy law. It’s the reason you’ve been getting so many emails about everyone’s updated privacy policy. In this episode, we talk about some of the potential ramifications of GRPD in the world of…
Git for Data Scientists
Jun 3, 2018 • 22 min
If you’re a data scientist, chances are good that you’ve heard of git, which is a system for version controlling code. Chances are also good that you’re not quite as up on git as you want to be—git has a strong following among software engineers but, in…
Analytics Maturity
May 20, 2018 • 19 min
Data science and analytics are hot topics in business these days, but for a lot of folks looking to bring data into their organization, it can be hard to know where to start and what it looks like when they’re succeeding. That was the motivation for…
SHAP: Shapley Values in Machine Learning
May 13, 2018 • 19 min
Shapley values in machine learning are an interesting and useful enough innovation that we figured hey, why not do a two-parter? Our last episode focused on explaining what Shapley values are: they define a way of assigning credit for outcomes across…
Game Theory for Model Interpretability: Shapley Values
May 6, 2018 • 27 min
As machine learning models get into the hands of more and more users, there’s an increasing expectation that black box isn’t good enough: users want to understand why the model made a given prediction, not just what the prediction itself is. This is…
AutoML
Apr 29, 2018 • 15 min
If you were a machine learning researcher or data scientist ten years ago, you might have spent a lot of time implementing individual algorithms like decision trees and neural networks by hand. If you were doing that work five years ago, the algorithms…
CPUs, GPUs, TPUs: Hardware for Deep Learning
Apr 22, 2018 • 12 min
A huge part of the ascent of deep learning in the last few years is related to advances in computer hardware that makes it possible to do the computational heavy lifting required to build models with thousands or even millions of tunable parameters. This…
A Technical Introduction to Capsule Networks
Apr 15, 2018 • 31 min
Last episode we talked conceptually about capsule networks, the latest and greatest computer vision innovation to come out of Geoff Hinton’s lab. This week we’re getting a little more into the technical details, for those of you ready to have your mind…
A Conceptual Introduction to Capsule Networks
Apr 8, 2018 • 14 min
Convolutional nets are great for image classification… if this were 2016. But it’s 2018 and Canada’s greatest neural networker Geoff Hinton has some new ideas, namely capsule networks. Capsule nets are a completely new type of neural net architecture…
Convolutional Neural Nets
Apr 1, 2018 • 21 min
If you’ve done image recognition or computer vision tasks with a neural network, you’ve probably used a convolutional neural net. This episode is all about the architecture and implementation details of convolutional networks, and the tricks that make…
Google Flu Trends
Mar 25, 2018 • 12 min
It’s been a nasty flu season this year. So we were remembering a story from a few years back (but not covered yet on this podcast) about when Google tried to predict flu outbreaks faster than the Centers for Disease Control by monitoring searches and…
How to pick projects for a professional data science team
Mar 18, 2018 • 31 min
This week’s episodes is for data scientists, sure, but also for data science managers and executives at companies with data science teams. These folks all think very differently about the same question: what should a data science team be working on? And…
Autoencoders
Mar 11, 2018 • 12 min
Autoencoders are neural nets that are optimized for creating outputs that… look like the inputs to the network. Turns out this is a not-too-shabby way to do unsupervised machine learning with neural nets.
When Private Data Isn’t Private Anymore
Mar 4, 2018 • 26 min
After all the back-patting around making data science datasets and code more openly available, we figured it was time to also dump a bucket of cold water on everyone’s heads and talk about the things that can go wrong when data and code is a little too…
What makes a machine learning algorithm “superhuman”?
Feb 25, 2018 • 34 min
A few weeks ago, we podcasted about a neural network that was being touted as “better than doctors” in diagnosing pneumonia from chest x-rays, and how the underlying dataset used to train the algorithm raised some serious questions. We’re back again this…
Open Data and Open Science
Feb 18, 2018 • 16 min
One interesting trend we’ve noted recently is the proliferation of papers, articles and blog posts about data science that don’t just tell the result—they include data and code that allow anyone to repeat the analysis. It’s far from universal (for a…
Defining the quality of a machine learning production system
Feb 11, 2018 • 20 min
Building a machine learning system and maintaining it in production are two very different things. Some folks over at Google wrote a paper that shares their thoughts around all the items you might want to test or check for your production ML system.…
Auto-generating websites with deep learning
Feb 4, 2018 • 19 min
We’ve already talked about neural nets in some detail (links below), and in particular we’ve been blown away by the way that image recognition from convolutional neural nets can be fed into recurrent neural nets that generate descriptions and captions of…
The Case for Learned Index Structures, Part 2: Hash Maps and Bloom Filters
Jan 28, 2018 • 20 min
Last week we started the story of how you could use a machine learning model in place of a data structure, and this week we wrap up with an exploration of Bloom Filters and Hash Maps. Just like last week, when we covered B-trees, we’ll walk through both…
The Case for Learned Index Structures, Part 1: B-Trees
Jan 21, 2018 • 18 min
Jeff Dean and his collaborators at Google are turning the machine learning world upside down (again) with a recent paper about how machine learning models can be used as surprisingly effective substitutes for classic data structures. In this first part of…
Challenges with Using Machine Learning to Classify Chest X-Rays
Jan 14, 2018 • 18 min
Another installment in our “machine learning might not be a silver bullet for solving medical problems” series. This week, we have a high-profile blog post that has been making the rounds for the last few weeks, in which a neural network trained to…
The Fourier Transform
Jan 7, 2018 • 15 min
The Fourier transform is one of the handiest tools in signal processing for dealing with periodic time series data. Using a Fourier transform, you can break apart a complex periodic function into a bunch of sine and cosine waves, and figure out what the…
Statistics of Beer
Jan 1, 2018 • 15 min
What better way to kick off a new year than with an episode on the statistics of brewing beer?
Re - Release: Random Kanye
Dec 24, 2017 • 9 min
We have a throwback episode for you today as we take the week off to enjoy the holidays. This week: what happens when you have a markov chain that generates mashup Kanye West lyrics with Bible verses? Exactly what you think.
Debiasing Word Embeddings
Dec 17, 2017 • 18 min
When we covered the Word2Vec algorithm for embedding words, we mentioned parenthetically that the word embeddings it produces can sometimes be a little bit less than ideal—in particular, gender bias from our society can creep into the embeddings and give…
The Kernel Trick and Support Vector Machines
Dec 10, 2017 • 17 min
Picking up after last week’s episode about maximal margin classifiers, this week we’ll go into the kernel trick and how that (combined with maximal margin algorithms) gives us the much-vaunted support vector machine.
Maximal Margin Classifiers
Dec 3, 2017 • 14 min
Maximal margin classifiers are a way of thinking about supervised learning entirely in terms of the decision boundary between two classes, and defining that boundary in a way that maximizes the distance from any given point to the boundary. It’s a neat…
Re - Release: The Cocktail Party Problem
Nov 26, 2017 • 13 min
Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!
Clustering with DBSCAN
Nov 19, 2017 • 16 min
DBSCAN is a density-based clustering algorithm for doing unsupervised learning. It’s pretty nifty: with just two parameters, you can specify “dense” regions in your data, and grow those regions out organically to find clusters. In particular, it can fit…
The Kaggle Survey on Data Science
Nov 12, 2017 • 25 min
Want to know what’s going on in data science these days? There’s no better way than to analyze a survey with over 16,000 responses that recently released by Kaggle. Kaggle asked practicing and aspiring data scientists about themselves, their tools, how…
Machine Learning: The High Interest Credit Card of Technical Debt
Nov 5, 2017 • 22 min
This week, we’ve got a fun paper by our friends at Google about the hidden costs of maintaining machine learning workflows. If you’ve worked in software before, you’re probably familiar with the idea of technical debt, which are inefficiencies that crop…
Improving Upon a First-Draft Data Science Analysis
Oct 29, 2017 • 15 min
There are a lot of good resources out there for getting started with data science and machine learning, where you can walk through starting with a dataset and ending up with a model and set of predictions. Think something like the homework for your…
Survey Raking
Oct 22, 2017 • 17 min
It’s quite common for survey respondents not to be representative of the larger population from which they are drawn. But if you’re a researcher, you need to study the larger population using data from your survey respondents, so what should you do?…
Happy Hacktoberfest
Oct 15, 2017 • 15 min
It’s the middle of October, so you’ve already made two pull requests to open source repos, right? If you have no idea what we’re talking about, spend the next 20 minutes or so with us talking about the importance of open source software and how you can…
Re - Release: Kalman Runners
Oct 8, 2017 • 17 min
In honor of the Chicago marathon this weekend (and due in large part to Katie recovering from running in it…) we have a re-release of an episode about Kalman filters, which is part algorithm part elaborate metaphor for figuring out, if you’re running a…
Neural Net Dropout
Oct 1, 2017 • 18 min
Neural networks are complex models with many parameters and can be prone to overfitting. There’s a surprisingly simple way to guard against this: randomly destroy connections between hidden units, also known as dropout. It seems counterintuitive that…
Disciplined Data Science
Sep 24, 2017 • 29 min
As data science matures as a field, it’s becoming clearer what attributes a data science team needs to have to elevate their work to the next level. Most of our episodes are about the cool work being done by other people, but this one summarizes some…
Hurricane Forecasting
Sep 17, 2017 • 27 min
It’s been a busy hurricane season in the Southeastern United States, with millions of people making life-or-death decisions based on the forecasts around where the hurricanes will hit and with what intensity. In this episode we’ll deconstruct those…
Finding Spy Planes with Machine Learning
Sep 10, 2017 • 18 min
There are law enforcement surveillance aircraft circling over the United States every day, and in this episode, we’ll talk about how some folks at BuzzFeed used public data and machine learning to find them. The fun thing here, in our opinion, is the…
Data Provenance
Sep 3, 2017 • 22 min
Software engineers are familiar with the idea of versioning code, so you can go back later and revive a past state of the system. For data scientists who might want to reconstruct past models, though, it’s not just about keeping the modeling code. It’s…
Adversarial Examples
Aug 27, 2017 • 16 min
Even as we rely more and more on machine learning algorithms to help with everyday decision-making, we’re learning more and more about how they’re frighteningly easy to fool sometimes. Today we have a roundup of a few successful efforts to create robust…
Jupyter Notebooks
Aug 20, 2017 • 15 min
This week’s episode is just in time for JupyterCon in NYC, August 22-25… Jupyter notebooks are probably familiar to a lot of data nerds out there as a great open-source tool for exploring data, doing quick visualizations, and packaging code snippets with…
Curing Cancer with Machine Learning is Super Hard
Aug 13, 2017 • 19 min
Today, a dispatch on what can go wrong when machine learning hype outpaces reality: a high-profile partnership between IBM Watson and MD Anderson Cancer Center has recently hit the rocks as it turns out to be tougher than expected to cure cancer with…
KL Divergence
Aug 6, 2017 • 25 min
Kullback Leibler divergence, or KL divergence, is a measure of information loss when you try to approximate one distribution with another distribution. It comes to us originally from information theory, but today underpins other, more…
Sabermetrics
Jul 30, 2017 • 25 min
It’s moneyball time! SABR (the Society for American Baseball Research) is the world’s largest organization of statistics-minded baseball enthusiasts, who are constantly applying the craft of scientific analysis to trying to figure out who are the best…
What Data Scientists Can Learn from Software Engineers
Jul 23, 2017 • 23 min
We’re back again with friend of the pod Walt, former software engineer extraordinaire and current data scientist extraordinaire, to talk about some best practices from software engineering that are ready to jump the fence over to data science. If last…
Software Engineering to Data Science
Jul 16, 2017 • 19 min
Data scientists and software engineers often work side by side, building out and scaling technical products and services that are data-heavy but also require a lot of software engineering to build and maintain. In this episode, we’ll chat with a Friend of…
Re-Release: Fighting Cholera with Data, 1854
Jul 9, 2017 • 12 min
This episode was first released in November 2014. In the 1850s, there were a lot of things we didn’t know yet: how to create an airplane, how to split an atom, or how to control the spread of a common but deadly disease: cholera. When a cholera outbreak…
Re-Release: Data Mining Enron
Jul 2, 2017 • 32 min
This episode was first release in February 2015. In 2000, Enron was one of the largest and companies in the world, praised far and wide for its innovations in energy distribution and many other markets. By 2002, it was apparent that many bad apples had…
Factorization Machines
Jun 25, 2017 • 19 min
What do you get when you cross a support vector machine with matrix factorization? You get a factorization machine, and a darn fine algorithm for recommendation engines.
Anscombe’s Quartet
Jun 18, 2017 • 15 min
Anscombe’s Quartet is a set of four datasets that have the same mean, variance and correlation but look very different. It’s easy to think that having a good set of summary statistics (like mean, variance and correlation) can tell you everything important…
Traffic Metering Algorithms
Jun 11, 2017 • 18 min
Originally release June 2016 This episode is for all you (us) traffic nerds—we’re talking about the hidden structure underlying traffic on-ramp metering systems. These systems slow down the flow of traffic onto highways so that the highways don’t get…
Page Rank
Jun 4, 2017 • 19 min
The year: 1998. The size of the web: 150 million pages. The problem: information retrieval. How do you find the “best” web pages to return in response to a query? A graduate student named Larry Page had an idea for how it could be done better and created…
Fractional Dimensions
May 28, 2017 • 20 min
We chat about fractional dimensions, and what the actual heck those are.
Things You Learn When Building Models for Big Data
May 21, 2017 • 21 min
As more and more data gets collected seemingly every day, and data scientists use that data for modeling, the technical limits associated with machine learning on big datasets keep getting pushed back. This week is a first-hand case study in using…
How to Find New Things to Learn
May 14, 2017 • 17 min
If you’re anything like us, you a) always are curious to learn more about data science and machine learning and stuff, and b) are usually overwhelmed by how much content is out there (not all of it very digestible). We hope this podcast is a part of the…
Federated Learning
May 7, 2017 • 14 min
As machine learning makes its way into more and more mobile devices, an interesting question presents itself: how can we have an algorithm learn from training data that’s being supplied as users interact with the algorithm? In other words, how do we do…
Word2Vec
Apr 30, 2017 • 17 min
Word2Vec is probably the go-to algorithm for vectorizing text data these days. Which makes sense, because it is wicked cool. Word2Vec has it all: neural networks, skip-grams and bag-of-words implementations, a multiclass classifier that gets swapped out…
Feature Processing for Text Analytics
Apr 23, 2017 • 17 min
It seems like every day there’s more and more machine learning problems that involve learning on text data, but text itself makes for fairly lousy inputs to machine learning algorithms. That’s why there are text vectorization algorithms, which re-format…
Education Analytics
Apr 16, 2017 • 21 min
This week we’ll hop into the rapidly developing industry around predictive analytics for education. For many of the students who eventually drop out, data science is showing that there might be early warning signs that the student is in trouble—we’ll talk…
A Technical Deep Dive on Stanley, the First Self-Driving Car
Apr 9, 2017 • 40 min
In our follow-up episode to last week’s introduction to the first self-driving car, we will be doing a technical deep dive this week and talking about the most important systems for getting a car to drive itself 140 miles across the desert. Lidar? You…
An Introduction to Stanley, the First Self-Driving Car
Apr 2, 2017 • 13 min
In October 2005, 23 cars lined up in the desert for a 140 mile race. Not one of those cars had a driver. This was the DARPA grand challenge to see if anyone could build an autonomous vehicle capable of navigating a desert route (and if so, whose car could…
Feature Importance
Mar 26, 2017 • 20 min
Figuring out what features actually matter in a model is harder to figure out than you might first guess. When a human makes a decision, you can just ask them—why did you do that? But with machine learning models, not so much. That’s why we wanted to talk…
Space Codes!
Mar 19, 2017 • 23 min
It’s hard to get information to and from Mars. Mars is very far away, and expensive to get to, and the bandwidth for passing messages with Earth is not huge. The messages you do pass have to traverse millions of miles, which provides ample opportunity for…
Finding (and Studying) Wikipedia Trolls
Mar 12, 2017 • 15 min
You may be shocked to hear this, but sometimes, people on the internet can be mean. For some of us this is just a minor annoyance, but if you’re a maintainer or contributor of a large project like Wikipedia, abusive users can be a huge problem. Fighting…
A Sprint Through What’s New in Neural Networks
Mar 5, 2017 • 16 min
Advances in neural networks are moving fast enough that, even though it seems like we talk about them all the time around here, it also always seems like we’re barely keeping up. So this week we have another installment in our “neural nets: they so…
Stein’s Paradox
Feb 26, 2017 • 27 min
When you’re estimating something about some object that’s a member of a larger group of similar objects (say, the batting average of a baseball player, who belongs to a baseball team), how should you estimate it: use measurements of the individual, or get…
Empirical Bayes
Feb 19, 2017 • 18 min
Say you’re looking to use some Bayesian methods to estimate parameters of a system. You’ve got the normalization figured out, and the likelihood, but the prior… what should you use for a prior? Empirical Bayes has an elegant answer: look to your previous…
Endogenous Variables and Measuring Protest Effectiveness
Feb 12, 2017 • 16 min
Have you been out protesting lately, or watching the protests, and wondered how much effect they might have on lawmakers? It’s a tricky question to answer, since usually we need randomly distributed treatments (e.g. big protests) to understand causality,…
Calibrated Models
Feb 5, 2017 • 14 min
Remember last week, when we were talking about how great the ROC curve is for evaluating models? How things change… This week, we’re exploring calibrated risk models, because that’s a kind of model that seems like it would benefit from some nice ROC…
Rock the ROC Curve
Jan 29, 2017 • 15 min
This week: everybody’s favorite WWII-era classifier metric! But it’s not just for winning wars, it’s a fantastic go-to metric for all your classifier quality needs.
Ensemble Algorithms
Jan 22, 2017 • 13 min
If one machine learning model is good, are two models better? In a lot of cases, the answer is yes. If you build many ok models, and then bring them all together and use them in combination to make your final predictions, you’ve just created an ensemble…
How to evaluate a translation: BLEU scores
Jan 15, 2017 • 17 min
As anyone who’s encountered a badly translated text could tell you, not all translations are created equal. Some translations are smooth, fluent and sound like a poet wrote them; some are jerky, non-grammatical and awkward. When a machine is doing the…
Zero Shot Translation
Jan 8, 2017 • 25 min
Take Google-size data, the flexibility of a neural net, and all (well, most) of the languages of the world, and what you end up with is a pile of surprises. This episode is about some interesting features of Google’s new neural machine translation system,…
Google Neural Machine Translation
Jan 1, 2017 • 18 min
Recently, Google swapped out the backend for Google Translate, moving from a statistical phrase-based method to a recurrent neural network. This marks a big change in methodology: the tried-and-true statistical translation methods that have been in use…
Data and the Future of Medicine : Interview with Precision Medicine Initiative researcher Matt Might
Dec 25, 2016 • 34 min
Today we are delighted to bring you an interview with Matt Might, computer scientist and medical researcher extraordinaire and architect of President Obama’s Precision Medicine Initiative. As the Obama Administration winds down, we’re talking with Matt…
Special Crossover Episode: Partially Derivative interview with White House Data Scientist DJ Patil
Dec 18, 2016 • 46 min
We have the pleasure of bringing you a very special crossover episode this week: our friends at Partially Derivative (another great podcast about data science, you should check it out) recently interviewed White House Chief Data Scientist DJ Patil. We…
How to Lose at Kaggle
Dec 11, 2016 • 17 min
Competing in a machine learning competition on Kaggle is a kind of rite of passage for data scientists. Losing unexpectedly at the very end of the contest is also something that a lot of us have experienced. It’s not just bad luck: a very specific…
Attacking Discrimination in Machine Learning
Dec 4, 2016 • 23 min
Imagine there’s an important decision to be made about someone, like a bank deciding whether to extend a loan, or a school deciding to admit a student—unfortunately, we’re all too aware that discrimination can sneak into these situations (even when…
Recurrent Neural Nets
Nov 27, 2016 • 12 min
This week, we’re doing a crash course in recurrent neural networks—what the structural pieces are that make a neural net recurrent, how that structure helps RNNs solve certain time series problems, and the importance of forgetfulness in RNNs. Relevant…
Stealing a PIN with signal processing and machine learning
Nov 20, 2016 • 16 min
Want another reason to be paranoid when using the free coffee shop wifi? Allow us to introduce WindTalker, a system that cleverly combines a dose of signal processing with a dash of machine learning to (potentially) steal the PIN from your phone…
Neural Net Cryptography
Nov 13, 2016 • 16 min
Cryptography used to be the domain of information theorists and spies. There’s a new player now: neural networks. Given the task of communicating securely, neural networks are inventing new encryption methods that, as best we can tell, are unlike anything…
Deep Blue
Nov 6, 2016 • 20 min
In 1997, Deep Blue was the IBM algorithm/computer that did what no one, at the time, though possible: it beat the world’s best chess player. It turns out, though, that one of the most important moves in the matchup, where Deep Blue psyched out its…
Organizing Google’s Datasets
Oct 30, 2016 • 15 min
If you’re a data scientist, there’s a good chance you’re used to working with a lot of data. But there’s a lot of data, and then there’s Google-scale amounts of data. Keeping all that data organized is a Google-sized task, and as it happens, they’ve built…
Fighting Cancer with Data Science: Followup
Oct 23, 2016 • 25 min
A few months ago, Katie started on a project for the Vice President’s Cancer Moonshot surrounding how data can be used to better fight cancer. The project is all wrapped up now, so we wanted to tell you about how that work went and what changes to cancer…
The 19-year-old determining the US election
Oct 16, 2016 • 12 min
Sick of the presidential election yet? We are too, but there’s still almost a month to go, so let’s just embrace it together. This week, we’ll talk about one of the presidential polls, which has been kind of an outlier for quite a while. This week, the NY…
How to Steal a Model
Oct 9, 2016 • 13 min
What does it mean to steal a model? It means someone (the thief, presumably) can re-create the predictions of the model without having access to the algorithm itself, or the training data. Sound far-fetched? It isn’t. If that person can ask for…
Regularization
Oct 2, 2016 • 17 min
Lots of data is usually seen as a good thing. And it is a good thing—except when it’s not. In a lot of fields, a problem arises when you have many, many features, especially if there’s a somewhat smaller number of cases to learn from; supervised machine…
The Cold Start Problem
Sep 25, 2016 • 15 min
You might sometimes find that it’s hard to get started doing something, but once you’re going, it gets easier. Turns out machine learning algorithms, and especially recommendation engines, feel the same way. The more they “know” about a user, like what…
Open Source Software for Data Science
Sep 19, 2016 • 20 min
If you work in tech, software or data science, there’s an excellent chance you use tools that are built upon open source software. This is software that’s built and distributed not for a profit, but because everyone benefits when we work together and…
Scikit + Optimization = Scikit-Optimize
Sep 11, 2016 • 15 min
We’re excited to welcome a guest, Tim Head, who is one of the maintainers of the scikit-optimize package. With all the talk about optimization lately, it felt appropriate to get in a few words with someone who’s out there making it happen for python.…
Two Cultures: Machine Learning and Statistics
Sep 4, 2016 • 17 min
It’s a funny thing to realize, but data science modeling is usually about either explainability, interpretation and understanding, or it’s about predictive accuracy. But usually not both—optimizing for one tends to compromise the other. Leo Breiman was…
Optimization Solutions
Aug 28, 2016 • 20 min
You’ve got an optimization problem to solve, and a less-than-forever amount of time in which to solve it. What do? Use a heuristic optimization algorithm, like a hill climber or simulated annealing—we cover both in this episode! Relevant link:…
Optimization Problems
Aug 21, 2016 • 17 min
If modeling is about predicting the unknown, optimization tries to answer the question of what to do, what decision to make, to get the best results out of a given situation. Sometimes that’s straightforward, but sometimes… not so much. What makes an…
Multi-level modeling for understanding DEADLY RADIOACTIVE GAS
Aug 14, 2016 • 23 min
Ok, this episode is only sort of about DEADLY RADIOACTIVE GAS. It’s mostly about multilevel modeling, which is a way of building models with data that has distinct, related subgroups within it. What are multilevel models used for? Elections (we can’t get…
How Polls Got Brexit “Wrong”
Aug 7, 2016 • 15 min
Continuing the discussion of how polls do (and sometimes don’t) tell us what to expect in upcoming elections—let’s take a concrete example from the recent past, shall we? The Brexit referendum was, by and large, expected to shake out for “remain”, but…
Election Forecasting
Jul 31, 2016 • 28 min
Not sure if you heard, but there’s an election going on right now. Polls, surveys, and projections about, as far as the eye can see. How to make sense of it all? How are the projections made? Which are some good ones to follow? We’ll be your trusty guides…
Machine Learning for Genomics
Jul 24, 2016 • 20 min
Genomics data is some of the biggest #bigdata, and doing machine learning on it is unlocking new ways of thinking about evolution, genomic diseases like cancer, and what really makes each of us different for everyone else. This episode touches on some of…
Climate Modeling
Jul 17, 2016 • 19 min
Hot enough for you? Climate models suggest that it’s only going to get warmer in the coming years. This episode unpacks those models, so you understand how they work. A lot of the episodes we do are about fun studies we hear about, like “if you’re…
Reinforcement Learning Gone Wrong
Jul 10, 2016 • 28 min
Last week’s episode on artificial intelligence gets a huge payoff this week—we’ll explore a wonderful couple of papers about all the ways that artificial intelligence can go wrong. Malevolent actors? You bet. Collateral damage? Of course. Reward hacking?…
Reinforcement Learning for Artificial Intelligence
Jul 3, 2016 • 18 min
There’s a ton of excitement about reinforcement learning, a form of semi-supervised machine learning that underpins a lot of today’s cutting-edge artificial intelligence algorithms. Here’s a crash course in the algorithmic machinery behind AlphaGo, and…
Differential Privacy: how to study people without being weird and gross
Jun 26, 2016 • 18 min
Apple wants to study iPhone users’ activities and use it to improve performance. Google collects data on what people are doing online to try to improve their Chrome browser. Do you like the idea of this data being collected? Maybe not, if it’s being…
How the sausage gets made
Jun 19, 2016 • 29 min
Something a little different in this episode—we’ll be talking about the technical plumbing that gets our podcast from our brains to your ears. As it turns out, it’s a multi-step bucket brigade process of RSS feeds, links to downloads, and lots of…
SMOTE: makin’ yourself some fake minority data
Jun 12, 2016 • 14 min
Machine learning on imbalanced classes: surprisingly tricky. Many (most?) algorithms tend to just assign the majority class label to all the data and call it a day. SMOTE is an algorithm for manufacturing new minority class examples for yourself, to help…
Conjoint Analysis: like AB testing, but on steroids
Jun 5, 2016 • 18 min
Conjoint analysis is like AB tester, but more bigger more better: instead of testing one or two things, you can test potentially dozens of options. Where might you use something like this? Well, if you wanted to design an entire hotel chain completely…
Traffic Metering Algorithms
May 29, 2016 • 17 min
This episode is for all you (us) traffic nerds—we’re talking about the hidden structure underlying traffic on-ramp metering systems. These systems slow down the flow of traffic onto highways so that the highways don’t get overloaded with cars and clog up.…
Um Detector 2: The Dynamic Time Warp
May 22, 2016 • 14 min
One tricky thing about working with time series data, like the audio data in our “um” detector (remember that? because we barely do…), is that sometimes events look really similar but one is a little bit stretched and squeezed relative to the other.…
Inside a Data Analysis: Fraud Hunting at Enron
May 15, 2016 • 30 min
It’s storytime this week—the story, from beginning to end, of how Katie designed and built the main project for Udacity’s Intro to Machine Learning class, when she was developing the course. The project was to use email and financial data to hunt for…
What’s the biggest #bigdata?
May 8, 2016 • 25 min
Data science and is often mentioned in the same breath as big data. But how big is big data? And who has the biggest big data? CERN? Youtube? … Something (or someone) else? Relevant link:…
Data Contamination
May 1, 2016 • 20 min
Supervised machine learning assumes that the features and labels used for building a classifier are isolated from each other—basically, that you can’t cheat by peeking. Turns out this can be easier said than done. In this episode, we’ll talk about the…
Model Interpretation (and Trust Issues)
Apr 24, 2016 • 16 min
Machine learning algorithms can be black boxes—inputs go in, outputs come out, and what happens in the middle is anybody’s guess. But understanding how a model arrives at an answer is critical for interpreting the model, and for knowing if it’s doing…
Updates! Political Science Fraud and AlphaGo
Apr 17, 2016 • 31 min
We’ve got updates for you about topics from past shows! First, the political science scandal of the year 2015 has a new chapter, we’ll remind you about the original story and then dive into what has happened since. Then, we’ve got an update on AlphaGo,…
Ecological Inference and Simpson’s Paradox
Apr 10, 2016 • 18 min
Simpson’s paradox is the data science equivalent of looking through one eye and seeing a very clear trend, and then looking through the other eye and seeing the very clear opposite trend. In one case, you see a trend one way in a group, but then breaking…
Discriminatory Algorithms
Apr 3, 2016 • 15 min
Sometimes when we say an algorithm discriminates, we mean it can tell the difference between two types of items. But in this episode, we’ll talk about another, more troublesome side to discrimination: algorithms can be… racist? Sexist? Ageist? Yes to all…
Recommendation Engines and Privacy
Mar 27, 2016 • 31 min
This episode started out as a discussion of recommendation engines, like Netflix uses to suggest movies. There’s still a lot of that in here. But a related topic, which is both interesting and important, is how to keep data private in the era of…
Neural nets play cops and robbers (AKA generative adverserial networks)
Mar 20, 2016 • 18 min
One neural net is creating counterfeit bills and passing them off to a second neural net, which is trying to distinguish the real money from the fakes. Result: two neural nets that are better than either one would have been without the competition.…
A Data Scientist’s View of the Fight against Cancer
Mar 13, 2016 • 19 min
In this episode, we’re taking many episodes’ worth of insights and unpacking an extremely complex and important question—in what ways are we winning the fight against cancer, where might that fight go in the coming decade, and how do we know when we’re…
Congress Bots and DeepDrumpf
Mar 10, 2016 • 20 min
Hey, sick of the election yet? Fear not, there are algorithms that can automagically generate political-ish speech so that we never need to be without an endless supply of Congressional speeches and Donald Trump twitticisms! Relevant links:…
Multi - Armed Bandits
Mar 6, 2016 • 11 min
Multi-armed bandits: how to take your randomized experiment and make it harder better faster stronger. Basically, a multi-armed bandit experiment allows you to optimize for both learning and making use of your knowledge at the same time. It’s what the…
Experiments and Messy, Tricky Causality
Mar 3, 2016 • 16 min
“People with a family history of heart disease are more likely to eat healthy foods, and have a high incidence of heart attacks.” Did the healthy food cause the heart attacks? Probably not. But establishing causal links is extremely tricky, and extremely…
Backpropagation
Feb 28, 2016 • 12 min
The reason that neural nets are taking over the world right now is because they can be efficiently trained with the backpropagation algorithm. In short, backprop allows you to adjust the weights of the neural net based on how good of a job the neural net…
Text Analysis on the State Of The Union
Feb 25, 2016 • 22 min
First up in this episode: a crash course in natural language processing, and important steps if you want to use machine learning techniques on text data. Then we’ll take that NLP know-how and talk about a really cool analysis of State of the Union text,…
Paradigms in Artificial Intelligence
Feb 21, 2016 • 17 min
Artificial intelligence includes a number of different strategies for how to make machines more intelligent, and often more human-like, in their ability to learn and solve problems. An ambitious group of researchers is working right now to classify all…
Survival Analysis
Feb 18, 2016 • 15 min
Survival analysis is all about studying how long until an event occurs—it’s used in marketing to study how long a customer stays with a service, in epidemiology to estimate the duration of survival of a patient with some illness, and in social science to…
Gravitational Waves
Feb 14, 2016 • 20 min
All aboard the gravitational waves bandwagon—with the first direct observation of gravitational waves announced this week, Katie’s dusting off her physics PhD for a very special gravity-related episode. Discussed in this episode: what are gravitational…
The Turing Test
Feb 11, 2016 • 15 min
Let’s imagine a future in which a truly intelligent computer program exists. How would it convince us (humanity) that it was intelligent? Alan Turing’s answer to this question, proposed over 60 years ago, is that the program could convince a human…
Item Response Theory: how smart ARE you?
Feb 7, 2016 • 11 min
Psychometrics is all about measuring the psychological characteristics of people; for example, scholastic aptitude. How is this done? Tests, of course! But there’s a chicken-and-egg problem here: you need to know both how hard a test is, and how smart the…
Go!
Feb 4, 2016 • 19 min
As you may have heard, a computer beat a world-class human player in Go last week. As recently as a year ago the prediction was that it would take a decade to get to this point, yet here we are, in 2016. We’ll talk about the history and strategy of…
Great Social Networks in History
Jan 31, 2016 • 12 min
The Medici were one of the great ruling families of Europe during the Renaissance. How did they come to rule? Not power, or money, or armies, but through the strength of their social network. And speaking of great historical social networks, analysis of…
How Much to Pay a Spy (and a lil’ more auctions)
Jan 29, 2016 • 16 min
A few small encores on auction theory, and then—how can you value a piece of information before you know what it is? Decision theory has some pointers. Some highly relevant information if you are trying to figure out how much to pay a spy. Relevant links:…
Sold! Auctions (Part 2)
Jan 24, 2016 • 17 min
The Google ads auction is a special kind of auction, one you might not know as well as the famous English auction (which we talked about in the last episode). But if it’s what Google uses to sell billions of dollars of ad space in real time, you know it…
Going Once, Going Twice: Auctions (Part 1)
Jan 21, 2016 • 12 min
The Google AdWords algorithm is (famously) an auction system for allocating a massive amount of online ad space in real time—with that fascinating use case in mind, this episode is part one in a two-part series all about auctions. We dive into the theory…
Chernoff Faces and Minard Maps
Jan 17, 2016 • 15 min
A data visualization extravaganza in this episode, as we discuss Chernoff faces (you: “faces? huh?” us: “oh just you wait”) and the greatest data visualization of all time, or at least the Napoleonic era. Relevant links:…
t-SNE: Reduce Your Dimensions, Keep Your Clusters
Jan 14, 2016 • 16 min
Ever tried to visualize a cluster of data points in 40 dimensions? Or even 4, for that matter? We prefer to stick to 2, or maybe 3 if we’re feeling well-caffeinated. The t-SNE algorithm is one of the best tools on the market for doing dimensionality…
The [Expletive Deleted] Problem
Jan 10, 2016 • 9 min
The town of [expletive deleted], England, is responsible for the clbuttic [expletive deleted] problem. This week on Linear Digressions: we try really hard not to swear too much. Related links: https://en.wikipedia.org/wiki/Scunthorpe_problem…
Unlabeled Supervised Learning—whaaa?
Jan 7, 2016 • 12 min
In order to do supervised learning, you need a labeled training dataset. Or do you…? Relevant links: http://www.cs.columbia.edu/~dplewis/candidacy/goldman00enhancing.pdf
Hacking Neural Nets
Jan 4, 2016 • 15 min
Machine learning: it can be fooled, just like you or me. Here’s one of our favorite examples, a study into hacking neural networks. Relevant links: http://arxiv.org/pdf/1412.1897v4.pdf
Zipf’s Law
Dec 31, 2015 • 11 min
Zipf’s law is related to the statistics of how word usage is distributed. As it turns out, this is also strikingly reminiscent of how income is distributed, and populations of cities, and bug reports in software, as well as tons of other phenomena that we…
Indie Announcement
Dec 30, 2015 • 1 min
We’ve gone indie! Which shouldn’t change anything about the podcast that you know and love, but we’re super excited to keep bringing you Linear Digressions as a fully independent podcast. Some links mentioned in the show:…
Portrait Beauty
Dec 27, 2015 • 11 min
It’s Da Vinci meets Skynet: what makes a portrait beautiful, according to a machine learning algorithm. Snap a selfie and give us a listen.
The Cocktail Party Problem
Dec 17, 2015 • 12 min
Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!
A Criminally Short Introduction to Semi Supervised Learning
Dec 3, 2015 • 9 min
Because there are more interesting problems than there are labeled datasets, semi-supervised learning provides a framework for getting feedback from the environment as a proxy for labels of what’s “correct.” Of all the machine learning methodologies, it…
Thresholdout: Down with Overfitting
Nov 27, 2015 • 15 min
Overfitting to your training data can be avoided by evaluating your machine learning algorithm on a holdout test dataset, but what about overfitting to the test data? Turns out it can be done, easily, and you have to be very careful to avoid it. But an…
The State of Data Science
Nov 9, 2015 • 15 min
How many data scientists are there, where do they live, where do they work, what kind of tools do they use, and how do they describe themselves? RJMetrics wanted to know the answers to these questions, so they decided to find out and share their analysis…
Data Science for Making the World a Better Place
Nov 5, 2015 • 9 min
There’s a good chance that great data science is going on close to you, and that it’s going toward making your city, state, country, and planet a better place. Not all the data science questions being tackled out there are about finding the sleekest new…
Kalman Runners
Oct 28, 2015 • 14 min
The Kalman Filter is an algorithm for taking noisy measurements of dynamic systems and using them to get a better idea of the underlying dynamics than you could get from a simple extrapolation. If you’ve ever run a marathon, or been a nuclear missile, you…
Neural Net Inception
Oct 22, 2015 • 15 min
When you sleep, the neural pathways in your brain take the “white noise” of your resting brain, mix in your experiences and imagination, and the result is dreams (that is a highly unscientific explanation, but you get the idea). What happens when neural…
Benford’s Law
Oct 15, 2015 • 17 min
Sometimes numbers are… weird. Benford’s Law is a favorite example of this for us—it’s a law that governs the distribution of the first digit in certain types of numbers. As it turns out, if you’re looking up the length of a river, the population of a…
Guinness
Oct 6, 2015 • 14 min
Not to oversell it, but the student’s t-test has got to have the most interesting history of any statistical test. Which is saying a lot, right? Add some boozy statistical trivia to your arsenal in this epsiode.
PFun with P Values
Sep 1, 2015 • 17 min
Doing some science, and want to know if you might have found something? Or maybe you’ve just accomplished the scientific equivalent of going fishing and reeling in an old boot? Frequentist p-values can help you distinguish between “eh” and “oooh…
Watson
Aug 24, 2015 • 15 min
This machine learning algorithm beat the human champions at Jeopardy. What is… Watson?
Bayesian Psychics
Aug 17, 2015 • 11 min
Come get a little “out there” with us this week, as we use a meta-study of extrasensory perception (or ESP, often used in the same sentence as “psychics”) to chat about Bayesian vs. frequentist statistics.
Troll Detection
Aug 7, 2015 • 12 min
Ever found yourself wasting time reading online comments from trolls? Of course you have; we’ve all been there (it’s 4 AM but I can’t turn off the computer and go to sleep—someone on the internet is WRONG!). Now there’s a way to use machine learning to…
Yiddish Translation
Aug 2, 2015 • 12 min
Imagine a language that is mostly spoken rather than written, contains many words in other languages, and has relatively little written overlap with English. Now imagine writing a machine-learning-based translation system that can convert that language to…
Modeling Particles in Atomic Bombs
Jul 6, 2015 • 15 min
In a fun historical journey, Katie and Ben explore the history of the Manhattan Project, discuss the difficulties in modeling particle movement in atomic bombs with only punch-card computers and ingenuity, and eventually come to present-day uses of the…
Random Number Generation
Jun 19, 2015 • 10 min
Let’s talk about randomness! Although randomness is pervasive throughout the natural world, it’s surprisingly difficult to generate random numbers. And even if your numbers look random (but actually aren’t), it can have interesting consequences on the…
Electoral Insights (Part 2)
Jun 8, 2015 • 21 min
Following up on our last episode about how experiments can be performed in political science, now we explore a high-profile case of an experiment gone wrong. An extremely high-profile paper that was published in 2014, about how talking to people can…
Electoral Insights (Part 1)
Jun 5, 2015 • 9 min
The first of our two-parter discussing the recent electoral data fraud case. The results of the study in question were covered widely, including by This American Life (who later had to issue a retraction). Data science for election research involves…
Falsifying Data
Jun 1, 2015 • 17 min
In the first of a few episodes on fraud in election research, we’ll take a look at a case study from a previous Presidential election, where polling results were faked. What are some telltale signs that data fraud might be present in a dataset? We’ll…
Reporter Bot
May 20, 2015 • 11 min
There’s a big difference between a table of numbers or statistics, and the underlying story that a human might tell about how those numbers were generated. Think about a baseball game—the game stats and a newspaper story are describing the same thing, but…
Careers in Data Science
May 16, 2015 • 16 min
Let’s talk money. As a “hot” career right now, data science can pay pretty well. But for an individual person matched with a specific job or industry, how much should someone expect to make? Since Katie was on the job market lately, this was something…
That’s “Dr Katie” to You
May 14, 2015 • 3 min
Katie successfully defended her thesis! We celebrate her return, and talk a bit about what getting a PhD in Physics is like.
Neural Nets (Part 2)
May 11, 2015 • 10 min
In the last episode, we zipped through neural nets and got a quick idea of how they work and why they can be so powerful. Here’s the real payoff of that work: In this episode, we’ll talk about a brand-new pair of results, one from Stanford and one from…
Neural Nets (Part 1)
May 1, 2015 • 9 min
There is no known learning algorithm that is more flexible and powerful than the human brain. That’s quite inspirational, if you think about it—to level up machine learning, maybe we should be going back to biology and letting millions of year of…
Inferring Authorship (Part 2)
Apr 28, 2015 • 14 min
Now that we’re up to speed on the classic author ID problem (who wrote the unsigned Federalist Papers?), we move onto a couple more contemporary examples. First, J.K. Rowling was famously outed using computational linguistics (and Twitter) when she wrote…
Inferring Authorship (Part 1)
Apr 16, 2015 • 8 min
This episode is inspired by one of our projects for Intro to Machine Learning: given a writing sample, can you use machine learning to identify who wrote it? Turns out that the answer is yes, a person’s writing style is as distinctive as their vocal…
Statistical Mistakes and the Challenger Disaster
Apr 6, 2015 • 13 min
After the Challenger exploded in 1986, killing all 7 astronauts aboard, an investigation into the cause was immediately launched. In the cold temperatures the night before the launch, the o-rings that seal off the fuel tanks from the rocket boosters…
Genetics and Um Detection (HMM Part 2)
Mar 25, 2015 • 14 min
In part two of our series on Hidden Markov Models (HMMs), we talk to Katie and special guest Francesco about more useful and novel applications of HMMs. We revisit Katie’s “Um Detector,” and hear about how HMMs are used in genetics research.
Introducing Hidden Markov Models (HMM Part 1)
Mar 24, 2015 • 14 min
Wikipedia says, “A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states.” What does that even mean? In part one of a special two-parter on HMMs, Katie,…
Monte Carlo For Physicists
Mar 12, 2015 • 8 min
This is another physics-centered podcast, about an ML-backed particle identification tool that we use to figure out what kind of particle caused a particular blob in the detector. But in this case, as in many cases, it looks hard at the outset to use ML…
Random Kanye
Mar 4, 2015 • 8 min
Ever feel like you could randomly assemble words from a certain vocabulary and make semi-coherent Kanye West lyrics? Or technical documentation, imitations of local newscasters, your politically outspoken uncle, etc.? Wonder no more, there’s a way to do…
Lie Detectors
Feb 25, 2015 • 9 min
Often machine learning discussions center around algorithms, or features, or datasets—this one centers around interpretation, and ethics. Suppose you could use a technology like fMRI to see what regions of a person’s brain are active when they ask…
The Enron Dataset
Feb 8, 2015 • 12 min
In 2000, Enron was one of the largest and companies in the world, praised far and wide for its innovations in energy distribution and many other markets. By 2002, it was apparent that many bad apples had been cooking the books, and billions of dollars and…
Labels and Where To Find Them
Feb 3, 2015 • 13 min
Supervised classification is built on the backs of labeled datasets, but a good set of labels can be hard to find. Great data is everywhere, but the corresponding labels can sometimes be really tricky. Take a few examples we’ve already covered, like lie…
Um Detector 1
Jan 23, 2015 • 13 min
So, um… what about machine learning for audio applications? In the course of starting this podcast, we’ve edited out a lot of “um“‘s from our raw audio files. It’s gotten now to the point that, when we see the waveform in soundstudio, we can almost…
Better Facial Recognition with Fisherfaces
Jan 6, 2015 • 11 min
Now that we know about eigenfaces (if you don’t, listen to the previous episode), let’s talk about how it breaks down. Variations that are trivial to humans when identifying faces can really mess up computer-driven facial ID—expressions, lighting, and…
Facial Recognition with Eigenfaces
Jan 6, 2015 • 10 min
A true classic topic in ML: Facial recognition is very high-dimensional, meaning that each picture can have millions of pixels, each of which can be a single feature. It’s computationally expensive to deal with all these features, and invites overfitting…
Stats of World Series Streaks
Dec 16, 2014 • 12 min
Baseball is characterized by a high level of equality between teams; even the best teams might only have 55% win percentages (contrast this with college football, where teams go undefeated pretty regularly). In this regime, where 2 outcomes (Giants…
Computers Try to Tell Jokes
Nov 26, 2014 • 9 min
Computers are capable of many impressive feats, but making you laugh is usually not one of them. Or could it be? This episode will talk about a custom-built machine learning algorithm that searches through text and writes jokes based on what it finds. The…
How Outliers Helped Defeat Cholera
Nov 21, 2014 • 10 min
In the 1850s, there were a lot of things we didn’t know yet: how to create an airplane, how to split an atom, or how to control the spread of a common but deadly disease: cholera. When a cholera outbreak in London killed scores of people, a doctor named…
Hunting for the Higgs
Nov 15, 2014 • 10 min
Machine learning and particle physics go together like peanut butter and jelly—but this is a relatively new development. For many decades, physicists looked through their fairly large datasets using the laws of physics to guide their exploration; that…