Applying critical thinking to Data Science

Attention Primer

A gentle introduction to the very high-level idea of “attention” in machine learning, as it will play a major role in some upcoming episodes over the next few weeks.

Cross-lingual Short-text Matching

Modern messaging technology has facilitated a trend towards highly compact, short messages send by users who can presume a great amount of context held between the communicating parties. The rules of grammar may be discarded and often visible…

ELMo

ELMo (Embeddings from Language Models) introduced the idea of deep contextualized word representations. It extends previous ideas like word2vec and GloVe. The ELMo model is a neural network able to map natural language into a vector space. This vector…

BLEU

Bilingual evaluation understudy (or BLEU) is a metric for evaluating the quality of machine translation using human translation as examples of acceptable quality results. This metric has become a widely used standard in the research literature. But is…

Simultaneous Translation at Baidu

While at NeurIPS 2018, Kyle chatted with Liang Huang about his work with Baidu research on simultaneous translation, which was demoed at the conference.

Human vs Machine Transcription

Machine transcription (the process of translating audio recordings of language to text) has come a long way in recent years. But how do the errors made during machine transcription compare to the errors made by a human transcriber? Find out in this…

seq2seq

A sequence to sequence (or seq2seq) model is neural architecture used for translation (and other tasks) which consists of an encoder and a decoder. The encoder/decoder architecture has obvious promise for machine translation, and has been successfully…

Text Mining in R

Kyle interviews Julia Silge about her path into data science, her book Text Mining with R, and some of the ways in which she’s used natural language processing in projects both personal and professional. Related Links …

Recurrent Relational Networks

One of the most challenging NLP tasks is natural language understanding and reasoning. How can we construct algorithms that are able to achieve human level understanding of text and be able to answer general questions about it? This is truly an open…

Text World and Word Embedding Lower Bounds

In the first half of this episode, Kyle speaks with Marc-Alexandre Côté and Wendy Tay about Text World. Text World is an engine that simulates text adventure games. Developers are encouraged to try out their reinforcement learning skills…

word2vec

Word2vec is an unsupervised machine learning model which is able to capture semantic information from the text it is trained on. The model is based on neural networks. Several large organizations like Google and Facebook have trained word embeddings…

Authorship Attribution

In a recent paper, Leveraging Discourse Information Effectively for Authorship Attribution, authors Su Wang, Elisa Ferracane, and Raymond J. Mooney describe a deep learning methodology for predict which of a collection of authors was the author of a…

Very Large Corpora and Zipf’s Law

The earliest efforts to apply machine learning to natural language tended to convert every token (every word, more or less) into a unique feature. While techniques like stemming may have cut the number of unique tokens down, researchers always had to…

Semantic search at Github

Github is many things besides source control. It’s a social network, even though not everyone realizes it. It’s a vast repository of code. It’s a ticketing and project management system. And of course, it has search as well. In this episode, Kyle…

Let’s Talk About Natural Language Processing

This episode reboots our podcast with the theme of Natural Language Processing for the next few months. We begin with introductions of Yoshi and Linh Da and then get into a broad discussion about natural language processing: what it is, what some of…

Data Science Hiring Processes

Kyle shares a few thoughts on mistakes observed by job applicants and also shares a few procedural insights listeners at early stages in their careers might find value in.

Holiday Reading - Epicac

Epicac by Kurt Vonnegut.

Drug Discovery with Machine Learning

In today’s episode, Kyle chats with Alexander Zhebrak, CTO of Insilico Medicine, Inc. Insilico self describes as artificial intelligence for drug discovery, biomarker development, and aging research. The conversation in this episode explores the ways…

Sign Language Recognition

At the NeurIPS 2018 conference, Stradigi AI premiered a training game which helps players learn American Sign Language. This episode brings the first of many interviews conducted at NeurIPS 2018. In this episode, Kyle interviews Chief…

Data Ethics

This week, Kyle interviews Scott Nestler on the topic of Data Ethics. Today, no ubiquitous, formal ethical protocol exists for data science, although some have been proposed. One example is the INFORMS Ethics Guidelines. Guidelines like…

Escaping the Rabbit Hole

Kyle interviews Mick West, author of Escaping the Rabbit Hole: How to Debunk Conspiracy Theories Using Facts, Logic, and Respect about the nature of conspiracy theories, the people that believe them, and how to help people escape the belief…

Theorem Provers

Fake news attempts to lead readers/listeners/viewers to conclusions that are not descriptions of reality. They do this most often by presenting false premises, but sometimes by presenting flawed logic. An argument is only sound and valid if the…

Automated Fact Checking

Fake news can be responded to with fact-checking. However, it’s easier to create fake news than the fact check it. Full Fact is the UK’s independent fact-checking organization. In this episode, Kyle interviews Mevan Babakar, head of…

Single Source of Truth

In mathematics, truth is universal. In data, truth lies in the where clause of the query. As large organizations have grown to rely on their data more significantly for decision making, a common problem is not being able to agree on what the…

Detecting Fast Radio Bursts with Deep Learning

Fast radio bursts are an astrophysical phenomenon first observed in 2007. While many observations have been made, science has yet to explain the mechanism for these events. This has led some to ask: could it be a form of extra-terrestrial…

Being Bayesian

This episode explores the root concept of what it is to be Bayesian: describing knowledge of a system probabilistically, having an appropriate prior probability, know how to weigh new evidence, and following Bayes’s rule to compute the revised…

Modeling Fake News

This is our interview with Dorje Brody about his recent paper with David Meier, How to model fake news. This paper uses the tools of communication theory and a sub-topic called filtering theory to describe the mathematical basis for an information…

The Louvain Method for Community Detection

Without getting into definitions, we have an intuitive sense of what a “community” is. The Louvain Method for Community Detection is one of the best known mathematical techniques designed to detect communities. This method requires typical graph data…

Cultural Cognition of Scientific Consensus

In this episode, our guest is Dan Kahan about his research into how people consume and interpret science news. In an era of fake news, motivated reasoning, and alternative facts, important questions need to be asked about how people understand new…

False Discovery Rates

A false discovery rate (FDR) is a methodology that can be useful when struggling with the problem of multiple comparisons. In any experiment, if the experimenter checks more than one dependent variable, then they are making multiple comparisons….

Deep Fakes

Digital videos can be described as sequences of still images and associated audio. Audio is easy to fake. What about video? A video can easily be broken down into a sequence of still images replayed rapidly in sequence. In this context, videos are…

Fake News Midterm

In this episode, Kyle reviews what we’ve learned so far in our series on Fake News and talks briefly about where we’re going next.

Quality Score

Two weeks ago we discussed click through rates or CTRs and their usefulness and limits as a metric. Today, we discuss a related metric known as quality score. While that phrase has probably been used to mean dozens of different things in…

The Knowledge Illusion

Kyle interviews Steven Sloman, Professor in the school of Cognitive, Linguistic, and Psychological Sciences at Brown University. Steven is co-author of The Knowledge Illusion: Why We Never Think Alone and Causal Models: How People Think about the…

Click Through Rates

A Click Through Rate (CTR) is the proportion of clicks to impressions of some item of content shared online. This terminology is most commonly used in digital advertising but applies just as well to content websites might choose to feature on their…

Algorithmic Detection of Fake News

The scale and frequency with which information can be distributed on social media makes the problem of fake news a rapidly metastasizing issue. To do any content filtering or labeling demands an algorithmic solution. In today’s episode, Kyle…

Ant Intelligence

If you prepared a list of creatures regarded as highly intelligent, it’s unlikely ants would make the cut. This is expected, as on an individual level, ants do not generally display behavior that most humans would regard as intelligence. In fact, it…

Human Detection of Fake News

With publications such as “Prior exposure increases perceived accuracy of fake news”, “Lazy, not biased: Susceptibility to partisan fake news is better explained by lack of reasoning than by motivated reasoning”, and “The science of fake news”,…

Spam Filtering with Naive Bayes

Today’s spam filters are advanced data driven tools. They rely on a variety of techniques to effectively and often seamlessly filter out junk email from good email. Whitelists, blacklists, traffic analysis, network analysis, and a variety of other…

The Spread of Fake News

How does fake news get spread online? Its not just a matter of manipulating search algorithms. The social platforms for sharing play a major role in the distribution of fake news. But how significant of an impact can there be? How significantly can…

Fake News

This episode kicks off our new theme of “Fake News” with guests Robert Sheaffer and Brad Schwartz. Fake news is a new label for an old idea. For our purposes, we will define fake news information created to deliberately mislead while…

Dev Ops for Data Science

We revisit the 2018 Microsoft Build in this episode, focusing on the latest ideas in DevOps. Kyle interviews Cloud Developer Advocates Damien Brady, Paige Bailey, and Donovan Brown to talk about DevOps and data science and databases. For a data…

First Order Logic

Logic is a fundamental of mathematical systems. It’s roots are the values true and false and it’s power is in what it’s rules allow you to prove. Prepositional logic provides it’s user variables. This episode gets into First Order Logic, an extension…

Blind Spots in Reinforcement Learning

An intelligent agent trained in a simulated environment may be prone to making mistakes in the real world due to discrepancies between the training and real-world conditions. The areas where an agent makes mistakes are hard to find, known as…

Defending Against Adversarial Attacks

In this week’s episode, our host Kyle interviews Gokula Krishnan from ETH Zurich, about his recent contributions to defenses against adversarial attacks. The discussion centers around his latest paper, titled “Defending Against Adversarial Attacks…

Transfer Learning

On a long car ride, Linhda and Kyle record a short episode. This discussion is about transfer learning, a technique using in machine learning to leverage training from one domain to have a head start learning in another domain. Transfer learning has…

Medical Imaging Training Techniques

Medical imaging is a highly effective tool used by clinicians to diagnose a wide array of diseases and injuries. However, it often requires exceptionally trained specialists such as radiologists to interpret accurately. In this episode of Data…

Kalman Filters

Thanks to our sponsor Galvanize A Kalman Filter is a technique for taking a sequence of observations about an object or variable and determining the most likely current state of that object. In this episode, we discuss it in the context of…

AI in Industry

There’s so much to discuss on the AI side, it’s hard to know where to begin. Luckily, Steve Guggenheimer, Microsoft’s corporate vice president of AI Business, and Carlos Pessoa, a software engineering manager for the company’s Cloud AI…

AI in Games

Today’s interview is with the authors of the textbook Artificial Intelligence and Games.

Game Theory

Thanks to our sponsor The Great Courses. This week’s episode is a short primer on game theory. For tickets to the free Data Skeptic meetup in Chicago on Tuesday, May 15 at the Mendoza College of Business (224 South Michigan Avenue, Suite 350),…

The Experimental Design of Paranormal Claims

In this episode of Data Skeptic, Kyle chats with Jerry Schwarz from the Independent Investigations Group (IIG)’s SF Bay Area chapter about testing claims of the paranormal. The IIG is a volunteer-based organization dedicated to…

Winograd Schema Challenge

Our guest this week, Hector Levesque, joins us to discuss an alternative way to measure a machine’s intelligence, called Winograd Schemas Challenge. The challenge was proposed as a possible alternative to the Turing test during the…

The Imitation Game

This week on Data Skeptic, we begin with a skit to introduce the topic of this show: The Imitation Game. We open with a scene in the distant future. The year is 2027, and a company called Shamony is announcing their new product, Ada, the most advanced…

Eugene Goostman

In this episode, Kyle shares his perspective on the chatbot Eugene Goostman which (some claim) “passed” the Turing Test. As a second topic Kyle also does an intro of the Winograd Schema Challenge.

The Theory of Formal Languages

In this episode, Kyle and Linhda discuss the theory of formal languages. Any language can (theoretically) be a formal language. The requirement is that the language can be rigorously described as a set of strings which are considered part of the…

The Loebner Prize

The Loebner Prize is a competition in the spirit of the Turing Test. Participants are welcome to submit conversational agent software to be judged by a panel of humans. This episode includes interviews with Charlie Maloney, a judge in the…

Chatbots

In this episode, Kyle chats with Vince from iv.ai and Heather Shapiro who works on the Microsoft Bot Framework. We solicit their advice on building a good chatbot both creatively and technically. Our sponsor today is Warby Parker.

The Master Algorithm

In this week’s episode, Kyle Polich interviews Pedro Domingos about his book, The Master Algorithm: How the quest for the ultimate learning machine will remake our world. In the book, Domingos describes what machine learning is doing for…

The No Free Lunch Theorems

What’s the best machine learning algorithm to use? I hear that XGBoost wins most of the Kaggle competitions that aren’t won with deep learning. Should I just use XGBoost all the time? That might work out most of the time in practice, but a proof…

ML at Sloan Kettering Cancer Center

For a long time, physicians have recognized that the tools they have aren’t powerful enough to treat complex diseases, like cancer. In addition to data science and models, clinicians also needed actual products — tools that physicians and…

Optimal Decision Making with POMDPs

In a previous episode, we discussed Markov Decision Processes or MDPs, a framework for decision making and planning. This episode explores the generalization Partially Observable MDPs (POMDPs) which are an incredibly general framework that describes…

AI Decision-Making

Making a decision is a complex task. Today’s guest Dongho Kim discusses how he and his team at Prowler has been building a platform that will be accessible by way of APIs and a set of pre-made scripts for autonomous decision making based on…

[MINI] Reinforcement Learning

In many real world situations, a person/agent doesn’t necessarily know their own objectives or the mechanics of the world they’re interacting with. However, if the agent receives rewards which are correlated with the both their actions and the state…

Evolutionary Computation

In this week’s episode, Kyle is joined by Risto Miikkulainen, a professor of computer science and neuroscience at the University of Texas at Austin. They talk about evolutionary computation, its applications in deep learning, and how it’s inspired…

[MINI] Markov Decision Processes

Formally, an MDP is defined as the tuple containing states, actions, the transition function, and the reward function. This podcast examines each of these and presents them in the context of simple examples. Despite MDPs suffering from the …

Neuroscience Frontiers

Last week on Data Skeptic, we visited the Laboratory of Neuroimaging, or LONI, at USC and learned about their data-driven platform that enables scientists from all over the world to share, transform, store, manage and analyze their data to understand…

Neuroimaging and Big Data

Last year, Kyle had a chance to visit the Laboratory of Neuroimaging, or LONI, at USC, and learn about how some researchers are using data science to study the function of the brain. We’re going to be covering some of their work in two episodes on…

The Agent Model of Artificial Intelligence

In artificial intelligence, the term ‘agent’ is used to mean an autonomous, thinking agent with the ability to interact with their environment. An agent could be a person or a piece of software. In either case, we can describe aspects of the agent in…

Artificial Intelligence, a Podcast Approach

This episode kicks off the next theme on Data Skeptic: artificial intelligence. Kyle discusses what’s to come for the show in 2018, why this topic is relevant, and how we intend to cover it.

Holiday reading 2017

We break format from our regular programming today and bring you an excerpt from Max Tegmark’s book “Life 3.0”. The first chapter is a short story titled “The Tale of the Omega Team”. Audio excerpted courtesy of Penguin Random House Audio…

Complexity and Cryptography

This week, our host Kyle Polich is joined by guest Tim Henderson from Google to talk about the computational complexity foundations of modern cryptography and the complexity issues that underlie the field. A key question that arises during the…

Mercedes Benz Machine Learning Research

This episode features an interview with Rigel Smiroldo recorded at NIPS 2017 in Long Beach California. We discuss data privacy, machine learning use cases, model deployment, and end-to-end machine learning.

[MINI] Parallel Algorithms

When computers became commodity hardware and storage became incredibly cheap, we entered the era of so-call “big” data. Most definitions of big data will include something about not being able to process all the data on a single machine. Distributed…

Quantum Computing

In this week’s episode, Scott Aaronson, a professor at the University of Texas at Austin, explains what a quantum computer is, various possible applications, the types of problems they are good at solving and much more. Kyle and Scott have a…

Azure Databricks

I sat down with Ali Ghodsi, CEO and found of Databricks, and John Chirapurath, GM for Data Platform Marketing at Microsoft related to the recent announcement of Azure Databricks. When I heard about the announcement, my first thoughts were…

[MINI] Exponential Time Algorithms

In this episode we discuss the complexity class of EXP-Time which contains algorithms which require $O(2^{p(n)})$ time to run. In other words, the worst case runtime is exponential in some polynomial of the input size. Problems in this…

P vs NP

In this week’s episode, host Kyle Polich interviews author Lance Fortnow about whether P will ever be equal to NP and solve all of life’s problems. Fortnow begins the discussion with the example question: Are there 100 people on Facebook who are all…

[MINI] Sudoku \in NP

Algorithms with similar runtimes are said to be in the same complexity class. That runtime is measured in the how many steps an algorithm takes relative to the input size. The class P contains all algorithms which run in polynomial time (basically, a…

The Computational Complexity of Machine Learning

In this episode, Professor Michael Kearns from the University of Pennsylvania joins host Kyle Polich to talk about the computational complexity of machine learning, complexity in game theory, and algorithmic fairness. Michael’s doctoral thesis gave an…

[MINI] Turing Machines

TMs are a model of computation at the heart of algorithmic analysis. A Turing Machine has two components. An infinitely long piece of tape (memory) with re-writable squares and a read/write head which is programmed to change it’s state as…

The Complexity of Learning Neural Networks

Over the past several years, we have seen many success stories in machine learning brought about by deep learning techniques. While the practical success of deep learning has been phenomenal, the formal guarantees have been lacking. Our current…

[MINI] Big Oh Analysis

How long an algorithm takes to run depends on many factors including implementation details and hardware. However, the formal analysis of algorithms focuses on how they will perform in the worst case as the input size grows. We refer to an…

Data science tools and other announcements from Ignite

In this episode, Microsoft’s Corporate Vice President for Cloud Artificial Intelligence, Joseph Sirosh, joins host Kyle Polich to share some of the Microsoft’s latest and most exciting innovations in AI development platforms. Last month, Microsoft…

Generative AI for Content Creation

Last year, the film development and production company End Cue produced a short film, called Sunspring, that was entirely written by an artificial intelligence using neural networks. More specifically, it was authored by a recurrent neural…

[MINI] One Shot Learning

One Shot Learning is the class of machine learning procedures that focuses learning something from a small number of examples. This is in contrast to “traditional” machine learning which typically requires a very large training set to build a…

Recommender Systems Live from FARCON 2017

Recommender systems play an important role in providing personalized content to online users. Yet, typical data mining techniques are not well suited for the unique challenges that recommender systems face. In this episode, host Kyle Polich joins Dr….

[MINI] Long Short Term Memory

A Long Short Term Memory (LSTM) is a neural unit, often used in Recurrent Neural Network (RNN) which attempts to provide the network the capacity to store information for longer periods of time. A LSTM unit remembers values for either long or short time…

Zillow Zestimate

Zillow is a leading real estate information and home-related marketplace. We interviewed Andrew Martin, a data science Research Manager at Zillow, to learn more about how Zillow uses data science and big data to make real estate predictions.

Cardiologist Level Arrhythmia Detection with CNNs

Our guest Pranav Rajpurkar and his coauthored recently published Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks, a paper in which they demonstrate the use of Convolutional Neural Networks which outperform board…

[MINI] Recurrent Neural Networks

RNNs are a class of deep learning models designed to capture sequential behavior. An RNN trains a set of weights which depend not just on new input but also on the previous state of the neural network. This directed cycle allows the…

Project Common Voice

Thanks to our sponsor Springboard. In this week’s episode, guest Andre Natal from Mozilla joins our host, Kyle Polich, to discuss a couple exciting new developments in open source speech recognition systems, which include Project Common…

[MINI] Bayesian Belief Networks

A Bayesian Belief Network is an acyclic directed graph composed of nodes that represent random variables and edges that imply a conditional dependence between them. It’s an intuitive way of encoding your statistical knowledge about a system and is…

pix2code

In this episode, Tony Beltramelli of UIzard Technologies joins our host, Kyle Polich, to talk about the ideas behind his latest app that can transform graphic design into functioning code, as well as his previous work on spying with wearables.

[MINI] Conditional Independence

In statistics, two random variables might depend on one another (for example, interest rates and new home purchases). We call this conditional dependence. An important related concept exists called conditional independence. This phrase describes…

Estimating Sheep Pain with Facial Recognition

Animals can’t tell us when they’re experiencing pain, so we have to rely on other cues to help treat their discomfort. But it is often difficult to tell how much an animal is suffering. The sheep, for instance, is the most inscrutable of animals….

CosmosDB

This episode collects interviews from my recent trip to Microsoft Build where I had the opportunity to speak with Dharma Shukla and Syam Nair about the recently announced CosmosDB. CosmosDB is a globally consistent, distributed datastore that supports…

[MINI] The Vanishing Gradient

This episode discusses the vanishing gradient - a problem that arises when training deep neural networks in which nearly all the gradients are very close to zero by the time back-propagation has reached the first hidden layer. This makes learning…

Doctor AI

hen faced with medical issues, would you want to be seen by a human or a machine? In this episode, guest Edward Choi, co-author of the study titled Doctor AI: Predicting Clinical Events via Recurrent Neural Network shares his thoughts. Edward presents…

[MINI] Activation Functions

In a neural network, the output value of a neuron is almost always transformed in some way using a function. A trivial choice would be a linear transformation which can only scale the data. However, other transformations, like a step function allow…

MS Build 2017

This episode recaps the Microsoft Build Conference. Kyle recently attended and shares some thoughts on cloud, databases, cognitive services, and artificial intelligence. The episode includes interviews with Rohan Kumar and David…

[MINI] Max-pooling

Max-pooling is a procedure in a neural network which has several benefits. It performs dimensionality reduction by taking a collection of neurons and reducing them to a single value for future layers to receive as input. It can also prevent…

Unsupervised Depth Perception

This episode is an interview with Tinghui Zhou. In the recent paper “Unsupervised Learning of Depth and Ego-motion from Video”, Tinghui and collaborators propose a deep learning architecture which is able to learn depth and pose information from…

[MINI] Convolutional Neural Networks

CNNs are characterized by their use of a group of neurons typically referred to as a filter or kernel. In image recognition, this kernel is repeated over the entire image. In this way, CNNs may achieve the property of translational…

Multi-Agent Diverse Generative Adversarial Networks

Despite the success of GANs in imaging, one of its major drawbacks is the problem of ‘mode collapse,’ where the generator learns to produce samples with extremely low variety. To address this issue, today’s guests Arnab Ghosh and Viveka Kulharia…

[MINI] Generative Adversarial Networks

GANs are an unsupervised learning method involving two neural networks iteratively competing. The discriminator is a typical learning system. It attempts to develop the ability to recognize members of a certain class, such as all photos which have…

Opinion Polls for Presidential Elections

Recently, we’ve seen opinion polls come under some skepticism. But is that skepticism truly justified? The recent Brexit referendum and US 2016 Presidential Election are examples where some claims the polls “got it wrong”. This…

OpenHouse

No reliable, complete database cataloging home sales data at a transaction level is available for the average person to access. To a data scientist interesting in studying this data, our hands are complete tied. Opportunities like testing sociological…

[MINI] GPU CPU

There’s more than one type of computer processor. The central processing unit (CPU) is typically what one means when they say “processor”. GPUs were introduced to be highly optimized for doing floating point computations in parallel. These types of…

[MINI] Backpropagation

Backpropagation is a common algorithm for training a neural network. It works by computing the gradient of each weight with respect to the overall error, and using stochastic gradient descent to iteratively fine tune the weights of the network….

Data Science at Patreon

In this week’s episode of Data Skeptic, host Kyle Polich talks with guest Maura Church, Patreon’s data science manager. Patreon is a fast-growing crowdfunding platform that allows artists and creators of all kinds build their own…

[MINI] Feed Forward Neural Networks

Feed Forward Neural Networks In a feed forward neural network, neurons cannot form a cycle. In this episode, we explore how such a network would be able to represent three common logical operators: OR, AND, and XOR. The XOR operation is the…

Reinventing Sponsored Search Auctions

In this Data Skeptic episode, Kyle is joined by guest Ruggiero Cavallo to discuss his latest efforts to mitigate the problems presented in this new world of online advertising. Working with his collaborators, Ruggiero reconsiders the search ad…

[MINI] The Perceptron

Today’s episode overviews the perceptron algorithm. This rather simple approach is characterized by a few particular features. It updates its weights after seeing every example, rather than as a batch. It uses a step function as an activation…

The Data Refuge Project

DataRefuge is a public collaborative, grassroots effort around the United States in which scientists, researchers, computer scientists, librarians and other volunteers are working to download, save, and re-upload government data. The DataRefuge…

[MINI] Automated Feature Engineering

If a CEO wants to know the state of their business, they ask their highest ranking executives. These executives, in turn, should know the state of the business through reports from their subordinates. This structure is roughly analogous to a process…

Big Data Tools and Trends

In this episode, I speak with Raghu Ramakrishnan, CTO for Data at Microsoft. We discuss services, tools, and developments in the big data sphere as well as the underlying needs that drove these innovations.

[MINI] Primer on Deep Learning

In this episode, we talk about a high-level description of deep learning. Kyle presents a simple game (pictured below), which is more of a puzzle really, to try and give Linh Da the basic concept. Thanks to our sponsor for…

Data Provenance and Reproducibility with Pachyderm

Versioning isn’t just for source code. Being able to track changes to data is critical for answering questions about data provenance, quality, and reproducibility. Daniel Whitenack joins me this week to talk about these concepts and share his work on…

[MINI] Logistic Regression on Audio Data

Logistic Regression is a popular classification algorithm. In this episode, we discuss how it can be used to determine if an audio clip represents one of two given speakers. It assumes an output variable (isLinhda) is a linear combination of available…

Studying Competition and Gender Through Chess

Prior work has shown that people’s response to competition is in part predicted by their gender. Understanding why and when this occurs is important in areas such as labor market outcomes. A well structured study is challenging due to numerous…

[MINI] Dropout

Deep learning can be prone to overfit a given problem. This is especially frustrating given how much time and computational resources are often required to converge. One technique for fighting overfitting is to use dropout. Dropout is the method of…

The Police Data and the Data Driven Justice Initiatives

In this episode I speak with Clarence Wardell and Kelly Jin about their mutual service as part of the White House’s Police Data Initiative and Data Driven Justice Initiative respectively. The Police Data Initiative was organized to use open data to…

The Library Problem

We close out 2016 with a discussion of a basic interview question which might get asked when applying for a data science job. Specifically, how a library might build a model to predict if a book will be returned late or not.

2016 Holiday Special

Today’s episode is a reading of Isaac Asimov’s Franchise. As mentioned on the show, this is just a work of fiction to be enjoyed and not in any way some obfuscated political statement. Enjoy, and happy holidays!

[MINI] Entropy

Classically, entropy is a measure of disorder in a system. From a statistical perspective, it is more useful to say it’s a measure of the unpredictability of the system. In this episode we discuss how information reduces the entropy in deciding…

MS Connect Conference

Cloud services are now ubiquitous in data science and more broadly in technology as well. This week, I speak to Mark Souza, Tobias Ternström, and Corey Sanders about various aspects of data at scale. We discuss the embedding of R into SQLServer,…

Causal Impact

Today’s episode is all about Causal Impact, a technique for estimating the impact of a particular event on a time series. We talk to William Martin about his research into the impact releases have on app and we also chat with Karen Blakemore about a…

[MINI] The Bootstrap

The Bootstrap is a method of resampling a dataset to possibly refine it’s accuracy and produce useful metrics on the result. The bootstrap is a useful statistical technique and is leveraged in Bagging (bootstrap aggregation) algorithms such as Random…

[MINI] Gini Coefficients

The Gini Coefficient (as it relates to decision trees) is one approach to determining the optimal decision to introduce which splits your dataset as part of a decision tree. To pick the right feature to split on, it considers the frequency of the…

Unstructured Data for Finance

Financial analysis techniques for studying numeric, well structured data are very mature. While using unstructured data in finance is not necessarily a new idea, the area is still very greenfield. On this episode,Delia Rusu shares her thoughts on the…

[MINI] AdaBoost

AdaBoost is a canonical example of the class of AnyBoost algorithms that create ensembles of weak learners. We discuss how a complex problem like predicting restaurant failure (which is surely caused by different problems in different situations)…

Stealing Models from the Cloud

Platform as a service is a growing trend in data science where services like fraud analysis and face detection can be provided via APIs. Such services turn the actual model into a black box to the consumer. But can the model be reverse engineered?…

[MINI] Calculating Feature Importance

For machine learning models created with the random forest algorithm, there is no obvious diagnostic to inform you which features are more important in the output of the model. Some straightforward but useful techniques exist revolving around removing…

NYC Bike Share Rebalancing

As cities provide bike sharing services, they must also plan for how to redistribute bicycles as they inevitably build up at more popular destination stations. In this episode, Hui Xiong talks about the solution he and his colleagues developed to …

[MINI] Random Forest

Random forest is a popular ensemble learning algorithm which leverages bagging both for sampling and feature selection. In this episode we make an analogy to the process of running a bookstore.

Election Predictions

Jo Hardin joins us this week to discuss the ASA’s Election Prediction Contest. This is a competition aimed at forecasting the results of the upcoming US presidential election competition. More details are available in Jo’s blog post found here. You…

[MINI] F1 Score

The F1 score is a model diagnostic that combines precision and recall to provide a singular evaluation for model comparison. In this episode we discuss how it applies to selecting an interior designer.

Urban Congestion

Urban congestion effects every person living in a city of any reasonable size. Lewis Lehe joins us in this episode to share his work on downtown congestion pricing. We explore topics of how different pricing mechanisms effect congestion as well as how…

[MINI] Heteroskedasticity

Heteroskedasticity is a term used to describe a relationship between two variables which has unequal variance over the range. For example, the variance in the length of a cat’s tail almost certainly changes (grows) with age. On the other…

Music21

Our guest today is Michael Cuthbert, an associate professor of music at MIT and principal investigator of the Music21 project, which we focus our discussion on today. Music21 is a python library making analysis of music accessible and fun. It supports…

[MINI] Paxos

Paxos is a protocol for arriving a consensus in a distributed computing system which accounts for unreliability of the nodes. We discuss how this might be used in the real world in the event of a massive disaster.

Trusting Machine Learning Models with LIME

Machine learning models are often criticized for being black boxes. If a human cannot determine why the model arrives at the decision it made, there’s good cause for skepticism. Classic inspection approaches to model interpretability are only useful…

[MINI] ANOVA

Analysis of variance is a method used to evaluate differences between the two or more groups. It works by breaking down the total variance of the system into the between group variance and within group variance. We discuss this method in…

Machine Learning on Images with Noisy Human-centric Labels

When humans describe images, they have a reporting bias, in that the report only what they consider important. Thus, in addition to considering whether something is present in an image, one should consider whether it is also relevant to the image…

[MINI] Survival Analysis

Survival analysis techniques are useful for studying the longevity of groups of elements or individuals, taking into account time considerations and right censorship. This episode explores how survival analysis can describe marriages, in particular,…

Predictive Models on Random Data

This week is an insightful discussion with Claudia Perlich about some situations in machine learning where models can be built, perhaps by well-intentioned practitioners, to appear to be highly predictive despite being trained on random data. Our…

[MINI] Receiver Operating Characteristic (ROC) Curve

An ROC curve is a plot that compares the trade off of true positives and false positives of a binary classifier under different thresholds. The area under the curve (AUC) is useful in determining how discriminating a model is. Together, ROC and AUC…

Multiple Comparisons and Conversion Optimization

I’m joined by Chris Stucchio this week to discuss how deliberate or uninformed statistical practitioners can derive spurious and arbitrary results via multiple comparisons. We discuss p-hacking and a variety of other important lessons and tips for…

[MINI] Leakage

If you’d like to make a good prediction, your best bet is to invent a time machine, visit the future, observe the value, and return to the past. For those without access to time travel technology, we need to avoid including information about the…

Predictive Policing

Kristian Lum (@KLdivergence) joins me this week to discuss her work at @hrdag on predictive policing. We also discuss Multiple Systems Estimation, a technique for inferring statistical information about a population from separate sources of…

[MINI] The CAP Theorem

Distributed computing cannot guarantee consistency, accuracy, and partition tolerance. Most system architects need to think carefully about how they should appropriately balance the needs of their application across these competing objectives. Linh Da…

Detecting Terrorists with Facial Recognition?

A startup is claiming that they can detect terrorists purely through facial recognition. In this solo episode, Kyle explores the plausibility of these claims.

[MINI] Goodhart’s Law

Goodhart’s law states that “When a measure becomes a target, it ceases to be a good measure”. In this mini-episode we discuss how this affects SEO, call centers, and Scrum.

Data Science at eHarmony

I’m joined this week by Jon Morra, director of data science at eHarmony to discuss a variety of ways in which machine learning and data science are being applied to help connect people for successful long term relationships. Interesting open source…

[MINI] Stationarity and Differencing

Mystery shoppers and fruit cultivation help us discuss stationarity - a property of some time serieses that are invariant to time in several ways. Differencing is one approach that can often convert a non-stationary process into a stationary one….

Feather

I’m joined by Wes McKinney (@wesmckinn) and Hadley Wickham (@hadleywickham) on this episode to discuss their joint project Feather. Feather is a file format for storing data frames along with some metadata, to help with interoperability between…

[MINI] Bargaining

Bargaining is the process of two (or more) parties attempting to agree on the price for a transaction. Game theoretic approaches attempt to find two strategies from which neither party is motivated to deviate. These strategies are said to…

deepjazz

Deepjazz is a project from Ji-Sung Kim, a computer science student at Princeton University. It is built using Theano, Keras, music21, and Evan Chow’s project jazzml. Deepjazz is a computational music project that creates original jazz…

[MINI] Auto-correlative functions and correlograms

When working with time series data, there are a number of important diagnostics one should consider to help understand more about the data. The auto-correlative function, plotted as a correlogram, helps explain how a given observations relates to…

Early Identification of Violent Criminal Gang Members

This week I spoke with Elham Shaabani and Paulo Shakarian (@PauloShakASU) about their recent paper Early Identification of Violent Criminal Gang Members (also available onarXiv). In this paper, they use social network analysis techniques and machine…

[MINI] Fractional Factorial Design

A dinner party at Data Skeptic HQ helps teach the uses of fractional factorial design for studying 2-way interactions.

Machine Learning Done Wrong

Cheng-tao Chu (@chengtao_chu) joins us this week to discuss his perspective on common mistakes and pitfalls that are made when doing machine learning. This episode is filled with sage advice for beginners and intermediate users of machine learning,…

Potholes

Co-host Linh Da was in a biking accident after hitting a pothole. She sustained an injury that required stitches. This is the story of our quest to file a 311 complaint and track it through the City of Los Angeles’s open data portal. My guests this…

[MINI] The Elbow Method

Certain data mining algorithms (including k-means clustering and k-nearest neighbors) require a user defined parameter k. A user of these algorithms is required to select this value, which raises the questions: what is the “best” value of k that one…

Too Good to be True

Today on Data Skeptic, Lachlan Gunn joins us to discuss his recent paper Too Good to be True. This paper highlights a somewhat paradoxical / counterintuitive fact about how unanimity is unexpected in cases where perfect measurements cannot be taken….

[MINI] R-squared

How well does your model explain your data? R-squared is a useful statistic for answering this question. In this episode we explore how it applies to the problem of valuing a house. Aspects like the number of bedrooms go a long way in explaining why…

Models of Mental Simulation

Jessica Hamrick joins us this week to discuss her work studying mental simulation. Her research combines machine learning approaches iwth behavioral method from cognitive science to help explain how people reason and predict…

[MINI] Multiple Regression

This episode is a discussion of multiple regression: the use of observations that are a vector of values to predict a response variable. For this episode, we consider how features of a home such as the number of bedrooms, number of bathrooms, and…

Scientific Studies of People’s Relationship to Music

Samuel Mehr joins us this week to share his perspective on why people are musical, where music comes from, and why it works the way it does. We discuss a number of empirical studies related to music and musical cognition, and dispense a few myths…

[MINI] k-d trees

This episode reviews the concept of k-d trees: an efficient data structure for holding multidimensional objects. Kyle gives Linhda a dictionary and asks her to look up words as a way of introducing the concept of binary search. We actually spend most…

Auditing Algorithms

Algorithms are pervasive in our society and make thousands of automated decisions on our behalf every day. The possibility of digital discrimination is a very real threat, and it is very plausible for discrimination to occur accidentally (i.e. outside…

[MINI] The Bonferroni Correction

Today’s episode begins by asking how many left handed employees we should expect to be at a company before anyone should claim left handedness discrimination. If not lefties, let’s consider eye color, hair color, favorite ska band, most recent grocery…

Detecting Pseudo-profound BS

A recent paper in the journal of Judgment and Decision Making titled On the reception and detection of pseudo-profound bullshit explores empirical questions around a reader’s ability to detect statements which may sound profound but are actually a…

[MINI] Gradient Descent

Today’s mini episode discusses the widely known optimization algorithm gradient descent in the context of hiking in a foggy hillside.

Let’s Kill the Word Cloud

This episode is a discussion of data visualization and a proposed New Year’s resolution for Data Skeptic listeners. Let’s kill the word cloud.

2015 Holiday Special

Today’s episode is a reading of Isaac Asimov’s The Machine that Won the War. I can’t think of a story that’s more appropriate for Data Skeptic.

Wikipedia Revision Scoring as a Service

In this interview with Aaron Halfaker of the Wikimedia Foundation, we discuss his research and career related to the study of Wikipedia. In his paper The Rise and Decline of an open Collaboration Community, he highlights a trend in the declining rate…

[MINI] Term Frequency - Inverse Document Frequency

Today’s topic is term frequency inverse document frequency, which is a statistic for estimating the importance of words and phrases in a set of documents.

The Hunt for Vulcan

Early astronomers could see several of the planets with the naked eye. The invention of the telescope allowed for further understanding of our solar system. The work of Isaac Newton allowed later scientists to accurately predict Neptune, which was…

[MINI] The Accuracy Paradox

Today’s episode discusses the accuracy paradox. There are cases when one might prefer a less accurate model because it yields more predictive power or better captures the underlying causal factors describing the outcome variable you are interested in….

Neuroscience from a Data Scientist’s Perspective

… or should this have been called data science from a neuroscientist’s perspective? Either way, I’m sure you’ll enjoy this discussion with Laurie Skelly. Laurie earned a PhD in Integrative Neuroscience from the Department of Psychology at the…

[MINI] Bias Variance Tradeoff

A discussion of the expected number of cars at a stoplight frames today’s discussion of the bias variance tradeoff. The central ideal of this concept relates to model complexity. A very simple model will likely generalize well from training to testing…

Big Data Doesn’t Exist

The recent opinion piece Big Data Doesn’t Exist on Tech Crunch by Slater Victoroff is an interesting discussion about the usefulness of data both big and small. Slater joins me this episode to discuss and expand on this discussion. Slater Victoroff…

[MINI] Covariance and Correlation

The degree to which two variables change together can be calculated in the form of their covariance. This value can be normalized to the correlation coefficient, which has the advantage of transforming it to a unitless measure strictly bounded between…

Bayesian A/B Testing

Today’s guest is Cameron Davidson-Pilon. Cameron has a masters degree in quantitative finance from the University of Waterloo. Think of it as statistics on stock markets. For the last two years he’s been the team lead of data science at Shopify. He’s…

[MINI] The Central Limit Theorem

The central limit theorem is an important statistical result which states that typically, the mean of a large enough set of independent trials is approximately normally distributed. This episode explores how this might be used to determine if an…

Accessible Technology

Today’s guest is Chris Hofstader (@gonz_blinko), an accessibility researcher and advocate, as well as an activist for causes such as improving access to information for blind and vision impaired people. His background in computer programming enabled…

[MINI] Multi-armed Bandit Problems

The multi-armed bandit problem is named with reference to slot machines (one armed bandits). Given the chance to play from a pool of slot machines, all with unknown payout frequencies, how can you maximize your reward? If you knew in advance which…

Shakespeare, Abiogenesis, and Exoplanets

Our episode this week begins with a correction. Back in episode 28 (Monkeys on Typewriters), Kyle made some bold claims about the probability that monkeys banging on typewriters might produce the entire works of Shakespeare by chance. The proof shown…

[MINI] Sample Sizes

There are several factors that are important to selecting an appropriate sample size and dealing with small samples. The most important questions are around representativeness - how well does your sample represent the total population and capture all…

The Model Complexity Myth

There’s an old adage which says you cannot fit a model which has more parameters than you have data. While this is often the case, it’s not a universal truth. Today’s guest Jake VanderPlas explains this topic in detail and provides some excellent…

[MINI] Distance Measures

There are many occasions in which one might want to know the distance or similarity between two things, for which the means of calculating that distance is not necessarily clear. The distance between two points in Euclidean space is generally…

ContentMine

ContentMine is a project which provides the tools and workflow to convert scientific literature into machine readable and machine interpretable data in order to facilitate better and more effective access to the accumulated knowledge of human kind….

[MINI] Structured and Unstructured Data

Today’s mini-episode explains the distinction between structured and unstructured data, and debates which of these categories best describe recipes.

Measuring the Influence of Fashion Designers

Yusan Lin shares her research on using data science to explore the fashion industry in this episode. She has applied techniques from data mining, natural language processing, and social network analysis to explore who are the innovators in the fashion…

[MINI] PageRank

PageRank is the algorithm most famous for being one of the original innovations that made Google stand out as a search engine. It was defined in the classic paper The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Larry…

Data Science at Work in LA County

with Benjamin Uminsky

[MINI] k-Nearest Neighbors

This episode explores the k-nearest neighbors algorithm which is an unsupervised, non-parametric method that can be used for both classification and regression. The basica concept is that it leverages some distance function on your dataset to find the…

Crypto

How do people think rationally about small probability events? What is the optimal statistical process by which one can update their beliefs in light of new evidence? This episode of Data Skeptic explores questions like this as Kyle consults a cast of…

[MINI] MapReduce

This mini-episode is a high level explanation of the basic idea behind MapReduce, which is a fundamental concept in big data. The origin of the idea comes from a Google paper titled MapReduce: Simplified Data Processing on Large Clusters. This…

Genetically Engineered Food and Trends in Herbicide Usage

The Credible Hulk joins me in this episode to discuss a recent blog post he wrote about glyphosate and the data about how it’s introduction changed the historical usage trends of other herbicides. Links to all the sources and…

[MINI] The Curse of Dimensionality

More features are not always better! With an increasing number of features to consider, machine learning algorithms suffer from the curse of dimensionality, as they have a wider set and often sparser coverage of examples to consider. This episode…

Video Game Analytics

with Anders Drachen

[MINI] Anscombe’s Quartet

This mini-episode discusses Anscombe’s Quartet, a series of four datasets which are clearly very different but share some similar statistical properties with one another. For example, each of the four plots has the same mean and variance on both…

Proposing Annoyance Mining

A recent episode of the Skeptics Guide to the Universe included a slight rant by Dr. Novella and the rouges about a shortcoming in operating systems. This episode explores why such a (seemingly obvious) flaw might make sense from an engineering…

Preserving History at Cyark

with Elizabeth Lee

[MINI] A Critical Examination of a Study of Marriage by Political Affiliation

Linhda and Kyle review a New York Times article titled How Your Hometown Affects Your Chances of Marriage. This article explores research about what correlates with the likelihood of being married by age 26 by county. Kyle and LinhDa discuss some…

Detecting Cheating in Chess

with Dr. Kenneth Regan

[MINI] z-scores

This week’s episode dicusses z-scores, also known as standard score. This score describes the distance (in standard deviations) that an observation is away from the mean of the population. A closely related top is the 68-95-99.7 rule which…

Using Data to Help Those in Crisis

A DataKind project with Crisis Text Line and Pivotal for Good

The Ghost in the MP3

Have you ever wondered what is lost when you compress a song into an MP3? This week’s guest Ryan Maguire did more than that. He worked on software to issolate the sounds that are lost when you convert a lossless digital audio recording into a…

Data Fest 2015

This episode contains converage of the 2015 Data Fest hosted at UCLA. Data Fest is an analysis competition that gives teams of students 48 hours to explore a new dataset and present novel findings. This year, data from Edmunds.com…

[MINI] Cornbread and Overdispersion

For our 50th episode we enduldge a bit by cooking Linhda’s previously mentioned “healthy” cornbread. This leads to a discussion of the statistical topic of overdispersion in which the variance of some distribution is larger than what…

[MINI] Natural Language Processing

This episode overviews some of the fundamental concepts of natural language processing including stemming, n-grams, part of speech tagging, and th bag of words approach.

Computer-based Personality Judgments

Guest Youyou Wu discuses the work she and her collaborators did to measure the accuracy of computer based personality judgments. Using Facebook “like” data, they found that machine learning approaches could be used to estimate user’s self assessment…

[MINI] Markov Chain Monte Carlo

This episode explores how going wine testing could teach us about using markov chain monte carlo (mcmc).

[MINI] Markov Chains

This episode introduces the idea of a Markov Chain. A Markov Chain has a set of states describing a particular system, and a probability of moving from one state to another along every valid connected state. Markov Chains are memoryless, meaning they…

Oceanography and Data Science

Nicole Goebel joins us this week to share her experiences in oceanography studying phytoplankton and other aspects of the ocean and how data plays a role in that science. We also discuss Thinkful where Nicole and I are both…

[MINI] Ordinary Least Squares Regression

This episode explores Ordinary Least Squares or OLS - a method for finding a good fit which describes a given dataset.

NYC Speed Camera Analysis with Tim Schmeier

New York State approved the use of automated speed cameras within a specific range of schools. Tim Schmeier did an analysis of publically available data related to these cameras as part of a project at the NYC Data Science Academy. Tim’s work…

[MINI] k-means clustering

The k-means clustering algorithm is an algorithm that computes a deterministic label for a given “k” number of clusters from an n-dimensional datset. This mini-episode explores how Yoshi, our lilac crowned amazon’s biological processes might be…

Shadow Profiles on Social Networks

Emre Sarigol joins me this week to discuss his paper Online Privacy as a Collective Phenomenon. This paper studies data collected from social networks and how the sharing behaviors of individuals can unintentionally reveal private…

[MINI] The Chi-Squared Test

The χ2 (Chi-Squared) test is a methodology for hypothesis testing. When one has categorical data, in the form of frequency counts or observations (e.g. Vegetarian, Pescetarian, and Omnivore), split into two or more categories (e.g. Male,…

Mapping Reddit Topics with Randy Olson

My quest this week is noteworthy a.i. researcher Randy Olson who joins me to share his work creating the Reddit World Map - a visualization that illuminates clusters in the reddit community based on user behavior. Randy’s blog…

[MINI] Partially Observable State Spaces

When dealing with dynamic systems that are potentially undergoing constant change, its helpful to describe what “state” they are in. In many applications the manner in which the state changes from one to another is not completely predictable,…

Easily Fooling Deep Neural Networks

My guest this week is Anh Nguyen, a PhD student at the University of Wyoming working in the Evolving AI lab. The episode discusses the paper Deep Neural Networks are Easily Fooled [pdf] by Anh Nguyen, Jason Yosinski, and Jeff Clune. It…

[MINI] Data Provenance

This episode introduces a high level discussion on the topic of Data Provenance, with more MINI episodes to follow to get into specific topics. Thanks to listener Sara L who wrote in to point out the Data Skeptic Podcast has focused alot…

Doubtful News, Geology, Investigating Paranormal Groups, and Thinking Scientifically with Sharon Hill

I had the change to speak with well known Sharon Hill (@idoubtit) for the first episode of 2015. We discuss a number of interesting topics including the contributions Doubtful News makes to getting scientific and skeptical…

[MINI] Belief in Santa

In this quick holiday episode, we touch on how one would approach modeling the statistical distribution over the probability of belief in Santa Claus given age.

Economic Modeling and Prediction, Charitable Giving, and a Follow Up with Peter Backus

Economist Peter Backus joins me in this episode to discuss a few interesting topics. You may recall Linhda and I previously discussed his paper “The Girlfriend Equation” on a recent mini-episode. We start by touching base on this fun paper and get a…

[MINI] The Battle of the Sexes

Love and Data is the continued theme in this mini-episode as we discuss the game theory example of The Battle of the Sexes. In this textbook example, a couple must strategize about how to spend their Friday night. One partner prefers football games…

The Science of Online Data at Plenty of Fish with Thomas Levi

Can algorithms help you find love? Many happy couples successfully brought together via online dating websites show us that data science can help you find love. I’m joined this week by Thomas Levi, Senior Data Scientist at Plenty of Fish, to…

[MINI] The Girlfriend Equation

Economist Peter Backus put forward “The Girlfriend Equation” while working on his PhD - a probabilistic model attempting to estimate the likelihood of him finding a girlfriend. In this mini episode we explore the soundness of his model and also share…

The Secret and the Global Consciousness Project with Alex Boklin

I’m joined this week by Alex Boklin to explore the topic of magical thinking especially in the context of Rhonda Byrne’s “The Secret”, and the similarities it bears to The Global Consciousness Project (GCP). The GCP puts forward the hypothesis that…

[MINI] Monkeys on Typewriters

What is randomness? How can we determine if some results are randomly generated or not? Why are random numbers important to us in our everyday life? These topics and more are discussed in this mini-episode on random numbers. Many readers will be…

Mining the Social Web with Matthew Russell

This week’s episode explores the possibilities of extracting novel insights from the many great social web APIs available. Matthew Russell’s Mining the Social Web is a fantastic exploration of the tools and methods, and we explore a few…

[MINI] Is the Internet Secure?

This episode explores the basis of why we can trust encryption. Suprisingly, a discussion of looking up a word in the dictionary (binary search) and efficiently going wine tasting (the travelling salesman problem) help introduce computational…

Practicing and Communicating Data Science with Jeff Stanton

Jeff Stanton joins me in this episode to discuss his book An Introduction to Data Science, and some of the unique challenges and issues faced by someone doing applied data science. A challenge to any data scientist is making sure they have a…

[MINI] The T-Test

The t-test is this week’s mini-episode topic. The t-test is a statistical testing procedure used to determine if the mean of two datasets differs by a statistically significant amount. We discuss how a wine manufacturer might apply a t-test to…

Data Myths with Karl Mamer

Black out baby boom | Subliminal messages | Beer and Diapers

Contest Announcement

The Data Skeptic Podcast is launching a contest- not one of chance, but one of skill. Listeners are encouraged to put their data science skills to good use, or if all else fails, guess! The contest works as follows. Below is some data about the…

[MINI] Selection Bias

A discussion about conducting US presidential election polls helps frame a converation about selection bias.

[MINI] Confidence Intervals

Commute times and BBQ invites help frame a discussion about the statistical concept of confidence intervals.

[MINI] Value of Information

A discussion about getting ready in the morning, negotiating a used car purchase, and selecting the best AirBnB place to stay at help frame a conversation about the decision theoretic principal known as the Value of Information equation.

Game Science Dice with Louis Zocchi

In this bonus episode, guest Louis Zocchi discusses his background in the gaming industry, specifically, how he became a manufacturer of dice designed to produce statistically uniform outcomes. During the show Louis mentioned a two part video…

Data Science at ZestFinance with Marick Sinay

Marick Sinay from ZestFianance is our guest this weel. This episode explores how data science techniques are applied in the financial world, specifically in assessing credit worthiness.

[MINI] Decision Tree Learning

Linhda and Kyle talk about Decision Tree Learning in this miniepisode. Decision Tree Learning is the algorithmic process of trying to generate an optimal decision tree to properly classify or forecast some future unlabeled element based by…

Jackson Pollock Authentication Analysis with Kate Jones-Smith

Our guest this week is Hamilton physics professor Kate Jones-Smith who joins us to discuss the evidence for the claim that drip paintings of Jackson Pollock contain fractal patterns. This hypothesis originates in a paper by Taylor, Micolich,…

[MINI] Noise!!

Our topic for this week is “noise” as in signal vs. noise. This is not a signal processing discussions, but rather a brief introduction to how the work noise is used to describe how much information in a dataset is useless (as opposed to…

Guerilla Skepticism on Wikipedia with Susan Gerbic

Our guest this week is Susan Gerbic. Susan is a skeptical activist involved in many activities, the one we focus on most in this episode is Guerrilla Skepticism on Wikipedia, an organization working to improve the content and citations of…

[MINI] Ant Colony Optimization

In this week’s mini episode, Linhda and Kyle discuss Ant Colony Optimization - a numerical / stochastic optimization technique which models its search after the process ants employ in using random walks to find a goal (food) and then leaving a…

Data in Healthcare IT with Shahid Shah

Our guest this week is Shahid Shah. Shahid is CEO at Netspective, and writes three blogs: Health Care Guy, Shahid Shah, and HitSphere - the Healthcare IT Supersite. During the program, Kyle recommended a talk from the 2014 MIT Sloan CIO Symposium…

[MINI] Cross Validation

This miniepisode discusses the technique called Cross Validation - a process by which one randomly divides up a dataset into numerous small partitions. Next, (typically) one is held out, and the rest are used to train some model. The hold out set can…

Streetlight Outage and Crime Rate Analysis with Zach Seeskin

This episode features a discussion with statistics PhD student Zach Seeskin about a project he was involved in as part of the Eric and Wendy Schmidt Data Science for Social Good Summer Fellowship. The project involved exploring the…

[MINI] Experimental Design

This episode loosely explores the topic of Experimental Design including hypothesis testing, the importance of statistical tests, and an everyday and business example.

The Right (big data) Tool for the Job with Jay Shankar

In this week’s episode, we discuss applied solutions to big data problem with big data engineer Jay Shankar. The episode explores approaches and design philosophy to solving real world big data business problems, and the exploration of the wide…

[MINI] Bayesian Updating

In this minisode, we discuss Bayesian Updating - the process by which one can calculate the most likely hypothesis might be true given one’s older / prior belief and all new evidence.

Personalized Medicine with Niki Athanasiadou

In the second full length episode of the podcast, we discuss the current state of personalized medicine and the advancements in genetics that have made it possible.

[MINI] p-values

In this mini, we discuss p-values and their use in hypothesis testing, in the context of an hypothetical experiment on plant flowering, and end with a reference to the Particle Fever documentary and how statistical significance played a role.

Advertising Attribution with Nathan Janos

A conversation with Convertro’s Nathan Janos about methodologies used to help advertisers understand the affect each of their marketing efforts (print, SEM, display, skywriting, etc.) contributes to their overall return.

[MINI] type i / type ii errors

In this first mini-episode of the Data Skeptic Podcast, we define and discuss type i and type ii errors (a.k.a. false positives and false negatives).

Introduction

The Data Skeptic Podcast features conversations with topics related to data science, statistics, machine learning, artificial intelligence and the like, all from the perspective of applying critical thinking and the scientific method to evaluate the…