Data Engineering Podcast

Data Engineering Podcast

www.dataengineeringpodcast.com
Weekly deep dives on data management with the engineers and entrepreneurs who are shaping the industry


Cutting Through The Noise And Focusing On The Fundamentals Of Data Engineering With The Data Janitor - Episode 151
Sep 21
Data engineering is a constantly growing and evolving discipline. There are always new tools, systems, and design patterns to learn, which leads to a great deal of confusion for newcomers. Daniel Molnar has dedicated his time to helping data professionals…
Distributed In Memory Processing And Streaming With Hazelcast - Episode 150
Sep 14 • 44 min
In memory computing provides significant performance benefits, but brings along challenges for managing failures and scaling up. Hazelcast is a platform for managing stateful in-memory storage and computation across a distributed cluster of commodity…
Simplify Your Data Architecture With The Presto Distributed SQL Engine - Episode 149
Sep 7 • 53 min
Databases are limited in scope to the information that they directly contain. For analytical use cases you often want to combine data across multiple sources and storage locations. This frequently requires cumbersome and time-consuming data integration.…
Building A Better Data Warehouse For The Cloud At Firebolt - Episode 148
Aug 31 • 65 min
Data warehouse technology has been around for decades and has gone through several generational shifts in that time. The current trends in data warehousing are oriented around cloud native architectures that take advantage of dynamic scaling and the…
Metadata Management And Integration At LinkedIn With DataHub - Episode 147
Aug 24 • 51 min
In order to scale the use of data across an organization there are a number of challenges related to discovery, governance, and integration that need to be solved. The key to those solutions is a robust and flexible metadata management system. LinkedIn…
Exploring The TileDB Universal Data Engine - Episode 146
Aug 17 • 65 min
Most databases are designed to work with textual data, with some special purpose engines that support domain specific formats. TileDB is a data engine that was built to support every type of data by using multi-dimensional arrays as the foundational…
Closing The Loop On Event Data Collection With Iteratively - Episode 145
Aug 10 • 59 min
Event based data is a rich source of information for analytics, unless none of the event structures are consistent. The team at Iteratively are building a platform to manage the end to end flow of collaboration around what events are needed, how to…
A Practical Introduction To Graph Data Applications - Episode 144
Aug 3 • 60 min
Finding connections between data and the entities that they represent is a complex problem. Graph data models and the applications built on top of them are perfect for representing relationships and finding emergent structures in your information. In this…
Build More Reliable Distributed Systems By Breaking Them With Jepsen - Episode 143
Jul 27 • 49 min
A majority of the scalable data processing platforms that we rely on are built as distributed systems. This brings with it a vast number of subtle ways that errors can creep in. Kyle Kingsbury created the Jepsen framework for testing the guarantees of…
Making Wind Energy More Efficient With Data At Turbit Systems - Episode 142
Jul 20 • 40 min
Wind energy is an important component of an ecologically friendly power system, but there are a number of variables that can affect the overall efficiency of the turbines. Michael Tegtmeier founded Turbit Systems to help operators of wind farms identify…
Open Source Production Grade Data Integration With Meltano - Episode 141
Jul 13
The first stage of every data pipeline is extracting the information from source systems. There are a number of platforms for managing data integration, but there is a notable lack of a robust and easy to use open source option. The Meltano project is…
DataOps For Streaming Systems With Lenses.io - Episode 140
Jul 6 • 45 min
There are an increasing number of use cases for real time data, and the systems to power them are becoming more mature. Once you have a streaming platform up and running you need a way to keep an eye on it, including observability, discovery, and…
Data Collection And Management For Teaching Machines To Hear At Audio Analytic - Episode 139
Jun 29 • 57 min
We have machines that can listen to and process human speech in a variety of languages, but dealing with unstructured sounds in our environment is a much greater challenge. The team at Audio Analytic are working to impart a sense of hearing to our myriad…
Bringing Business Analytics To End Users With GoodData - Episode 138
Jun 22 • 52 min
The majority of analytics platforms are focused on use internal to an organization by business stakeholders. As the availability of data increases and overall literacy in how to interpret it and take action improves there is a growing need to bring…
Accelerate Your Machine Learning With The StreamSQL Feature Store - Episode 137
Jun 15 • 46 min
Machine learning is a process driven by iteration and experimentation which requires fast and easy access to relevant features of the data being processed. In order to reduce friction in the process of developing and delivering models there has been a…
Data Management Trends From An Investor Perspective - Episode 136
Jun 8 • 54 min
The landscape of data management and processing is rapidly changing and evolving. There are certain foundational elements that have remained steady, but as the industry matures new trends emerge and gain prominence. In this episode Astasia Myers of…
Building A Data Lake For The Database Administrator At Upsolver - Episode 135
Jun 1 • 56 min
Data lakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. What used to be entirely managed by the database engine is now a composition of multiple systems that…
Mapping The Customer Journey For B2B Companies At Dreamdata - Episode 134
May 25 • 46 min
Gaining a complete view of the customer journey is especially difficult in B2B companies. This is due to the number of different individuals involved and the myriad ways that they interface with the business. Dreamdata integrates data from the multitude…
Power Up Your PostgreSQL Analytics With Swarm64 - Episode 133
May 18 • 52 min
The PostgreSQL database is massively popular due to its flexibility and extensive ecosystem of extensions, but it is still not the first choice for high performance analytics. Swarm64 aims to change that by adding support for advanced hardware…
StreamNative Brings Streaming Data To The Cloud Native Landscape With Pulsar - Episode 132
May 11
There have been several generations of platforms for managing streaming data, each with their own strengths and weaknesses, and different areas of focus. Pulsar is one of the recent entrants which has quickly gained adoption and an impressive set of…
Enterprise Data Operations And Orchestration At Infoworks - Episode 131
May 4 • 45 min
Data management is hard at any scale, but working in the context of an enterprise organization adds even greater complexity. Infoworks is a platform built to provide a unified set of tooling for managing the full lifecycle of data in large businesses. By…
Taming Complexity In Your Data Driven Organization With DataOps - Episode 130
Apr 27 • 61 min
Data is a critical element to every role in an organization, which is also what makes managing it so challenging. With so many different opinions about which pieces of information are most important, how it needs to be accessed, and what to do with it,…
Building Real Time Applications On Streaming Data With Eventador - Episode 129
Apr 19 • 50 min
Modern applications frequently require access to real-time data, but building and maintaining the systems that make that possible is a complex and time consuming endeavor. Eventador is a managed platform designed to let you focus on using the data that…
Making Data Collection In Your Code Easy With Rookout - Episode 128
Apr 13 • 26 min
The software applications that we build for our businesses are a rich source of data, but accessing and extracting that data is often a slow and error-prone process. Rookout has built a platform to separate the data collection process from the lifecycle…
Building A Knowledge Graph Of Commercial Real Estate At Cherre - Episode 127
Apr 6 • 45 min
Knowledge graphs are a data resource that can answer questions beyond the scope of traditional data analytics. By organizing and storing data to emphasize the relationship between entities, we can discover the complex connections between multiple sources…
The Life Of A Non-Profit Data Professional - Episode 126
Mar 30 • 44 min
Building and maintaining a system that integrates and analyzes all of the data for your organization is a complex endeavor. Operating on a shoe-string budget makes it even more challenging. In this episode Tyler Colby shares his experiences working as a…
Behind The Scenes Of The Linode Object Storage Service - Episode 125
Mar 23 • 35 min
There are a number of platforms available for object storage, including self-managed open source projects. But what goes on behind the scenes of the companies that run these systems at scale so you don’t have to? In this episode Will Smith shares the…
Building A New Foundation For CouchDB - Episode 124
Mar 16 • 55 min
CouchDB is a distributed document database built for scale and ease of operation. With a built-in synchronization protocol and a HTTP interface it has become popular as a backend for web and mobile applications. Created 15 years ago, it has accrued some…
Scaling Data Governance For Global Businesses With A Data Hub Architecture - Episode 123
Mar 9 • 54 min
Data governance is a complex endeavor, but scaling it to meet the needs of a complex or globally distributed organization requires a well considered and coherent strategy. In this episode Tim Ward describes an architecture that he has used successfully…
Easier Stream Processing On Kafka With ksqlDB - Episode 122
Mar 2 • 43 min
Building applications on top of unbounded event streams is a complex endeavor, requiring careful integration of multiple disparate systems that were engineered in isolation. The ksqlDB project was created to address this state of affairs by building a…
Shining A Light on Shadow IT In Data And Analytics - Episode 121
Feb 24 • 46 min
Misaligned priorities across business units can lead to tensions that drive members of the organization to build data and analytics projects without the guidance or support of engineering or IT staff. The availability of cloud platforms and managed…
Data Infrastructure Automation For Private SaaS At Snowplow - Episode 120
Feb 17 • 49 min
One of the biggest challenges in building reliable platforms for processing event pipelines is managing the underlying infrastructure. At Snowplow Analytics the complexity is compounded by the need to manage multiple instances of their platform across…
Data Modeling That Evolves With Your Business Using Data Vault - Episode 119
Feb 9 • 66 min
Designing the structure for your data warehouse is a complex and challenging process. As businesses deal with a growing number of sources and types of information that they need to integrate, they need a data modeling strategy that provides them with…
The Benefits And Challenges Of Building A Data Trust - Episode 118
Feb 3 • 56 min
Every business collects data in some fashion, but sometimes the true value of the collected information only comes when it is combined with other data sources. Data trusts are a legal framework for allowing businesses to collaboratively pool their data.…
Pay Down Technical Debt In Your Data Pipeline With Great Expectations - Episode 117
Jan 26 • 46 min
Data pipelines are complicated and business critical pieces of technical infrastructure. Unfortunately they are also complex and difficult to test, leading to a significant amount of technical debt which contributes to slower iteration cycles. In this…
Replatforming Production Dataflows - Episode 116
Jan 20 • 39 min
Building a reliable data platform is a neverending task. Even if you have a process that works for you and your business there can be unexpected events that require a change in your platform architecture. In this episode the head of data for Mayvenn…
Planet Scale SQL For The New Generation Of Applications - Episode 115
Jan 13 • 61 min
The modern era of software development is identified by ubiquitous access to elastic infrastructure for computation and easy automation of deployment. This has led to a class of applications that can quickly scale to serve users worldwide. This requires a…
Change Data Capture For All Of Your Databases With Debezium - Episode 114
Jan 5 • 53 min
Databases are useful for inspecting the current state of your application, but inspecting the history of that data can get messy without a way to track changes as they happen. Debezium is an open source platform for reliable change data capture that you…
Building The DataDog Platform For Processing Timeseries Data At Massive Scale - Episode 113
Dec 30, 2019 • 45 min
DataDog is one of the most successful companies in the space of metrics and monitoring for servers and cloud infrastructure. In order to support their customers, they need to capture, process, and analyze massive amounts of timeseries data with a high…
Building The Materialize Engine For Interactive Streaming Analytics In SQL - Episode 112
Dec 22, 2019 • 48 min
Transactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex analytical queries. Between those use cases…
Solving Data Lineage Tracking And Data Discovery At WeWork - Episode 111
Dec 16, 2019 • 61 min
Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata…
SnowflakeDB: The Data Warehouse Built For The Cloud - Episode 110
Dec 8, 2019 • 58 min
Data warehouses have gone through many transformations, from standard relational databases on powerful hardware, to column oriented storage engines, to the current generation of cloud-native analytical engines. SnowflakeDB has been leading the charge to…
Organizing And Empowering Data Engineers At Citadel - Episode 109
Dec 2, 2019 • 45 min
The financial industry has long been driven by data, requiring a mature and robust capacity for discovering and integrating valuable sources of information. Citadel is no exception, and in this episode Michael Watson and Robert Krzyzanowski share their…
Building A Real Time Event Data Warehouse For Sentry - Episode 108
Nov 26, 2019 • 61 min
The team at Sentry has built a platform for anyone in the world to send software errors and events. As they scaled the volume of customers and data they began running into the limitations of their initial architecture. To address the needs of their…
Escaping Analysis Paralysis For Your Data Platform With Data Virtualization - Episode 107
Nov 18, 2019 • 55 min
With the constant evolution of technology for data management it can seem impossible to make an informed decision about whether to build a data warehouse, or a data lake, or just leave your data wherever it currently rests. What’s worse is that any time…
Designing For Data Protection - Episode 106
Nov 11, 2019 • 51 min
The practice of data management is one that requires technical acumen, but there are also many policy and regulatory issues that inform and influence the design of our systems. With the introduction of legal frameworks such as the EU GDPR and California’s…
Automating Your Production Dataflows On Spark - Episode 105
Nov 4, 2019 • 48 min
As data engineers the health of our pipelines is our highest priority. Unfortunately, there are countless ways that our dataflows can break or degrade that have nothing to do with the business logic or data transformations that we write and maintain. Sean…
Build Maintainable And Testable Data Applications With Dagster - Episode 104
Oct 28, 2019 • 67 min
Despite the fact that businesses have relied on useful and accurate data to succeed for decades now, the state of the art for obtaining and maintaining that information still leaves much to be desired. In an effort to create a better abstraction for…
Data Orchestration For Hybrid Cloud Analytics - Episode 103
Oct 21, 2019 • 42 min
The scale and complexity of the systems that we build to satisfy business requirements is increasing as the available tools become more sophisticated. In order to bridge the gap between legacy infrastructure and evolving use cases it is necessary to…
Keeping Your Data Warehouse In Order With DataForm - Episode 102
Oct 14, 2019 • 47 min
Managing a data warehouse can be challenging, especially when trying to maintain a common set of patterns. Dataform is a platform that helps you apply engineering principles to your data transformations and table definitions, including unit testing SQL…
Fast Analytics On Semi-Structured And Structured Data In The Cloud - Episode 101
Oct 7, 2019 • 54 min
The process of exposing your data through a SQL interface has many possible pathways, each with their own complications and tradeoffs. One of the recent options is Rockset, a serverless platform for fast SQL analytics on semi-structured and structured…
Ship Faster With An Opinionated Data Pipeline Framework - Episode 100
Sep 30, 2019 • 35 min
Building an end-to-end pipeline for your machine learning projects is a complex task, made more difficult by the variety of ways that you can structure it. Kedro is a framework that provides an opinionated workflow that lets you focus on the parts that…
Open Source Object Storage For All Of Your Data - Episode 99
Sep 22, 2019 • 68 min
Object storage is quickly becoming the unifying layer for data intensive applications and analytics. Modern, cloud oriented data warehouses and data lakes both rely on the durability and ease of use that it provides. S3 from Amazon has quickly become the…
Navigating Boundless Data Streams With The Swim Kernel - Episode 98
Sep 18, 2019 • 57 min
The conventional approach to analytics involves collecting large amounts of data that can be cleaned, followed by a separate step for analysis and interpretation. Unfortunately this strategy is not viable for handling real-time, real-world use cases such…
Building A Reliable And Performant Router For Observability Data - Episode 97
Sep 9, 2019 • 55 min
The first stage in every data project is collecting information and routing it to a storage system for later analysis. For operational data this typically means collecting log messages and system metrics. Often a different tool is used for each class of…
Building A Community For Data Professionals at Data Council - Episode 96
Sep 2, 2019 • 52 min
Data professionals are working in a domain that is rapidly evolving. In order to stay current we need access to deeply technical presentations that aren’t burdened by extraneous marketing. To fulfill that need Pete Soderling and his team have been running…
Building Tools And Platforms For Data Analytics - Episode 95
Aug 26, 2019 • 48 min
Data engineers are responsible for building tools and platforms to power the workflows of other members of the business. Each group of users has their own set of requirements for the way that they access and interact with those platforms depending on the…
A High Performance Platform For The Full Big Data Lifecycle - Episode 94
Aug 19, 2019 • 73 min
Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing…
Digging Into Data Replication At Fivetran - Episode 93
Aug 12, 2019 • 44 min
The extract and load pattern of data replication is the most commonly needed process in data engineering workflows. Because of the myriad sources and destinations that are available, it is also among the most difficult tasks that we encounter. Fivetran is…
Solving Data Discovery At Lyft - Episode 92
Aug 5, 2019 • 51 min
Data is only valuable if you use it for something, and the first step is knowing that it is available. As organizations grow and data sources proliferate it becomes difficult to keep track of everything, particularly for analysts and data scientists who…
Simplifying Data Integration Through Eventual Connectivity - Episode 91
Jul 28, 2019 • 53 min
The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a small number of sources it is a tractable problem, but as the overall complexity of the data ecosystem continues to…
Straining Your Data Lake Through A Data Mesh - Episode 90
Jul 22, 2019 • 64 min
The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of data lakes as a…
Data Labeling That You Can Feel Good About With CloudFactory - Episode 89
Jul 14, 2019 • 57 min
Successful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is that most data is not clean and well annotated, requiring a scalable data labeling process. Ideally this process can…
Scale Your Analytics On The Clickhouse Data Warehouse - Episode 88
Jul 8, 2019 • 71 min
The market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented database engine built for interactive analytics with linear scalability. In this episode Robert Hodges and…
Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection - Episode 87
Jul 1, 2019 • 38 min
Anomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems monitoring. Scaling the volume of events that can be processed in real-time can be challenging, so Paul Brebner from…
The Workflow Engine For Data Engineers And Data Scientists - Episode 86
Jun 24, 2019 • 68 min
Building a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of both roles. Data engineering platforms have a strong focus on stateful execution and tasks that are strictly…
Maintaining Your Data Lake At Scale With Spark - Episode 85
Jun 16, 2019 • 50 min
Building and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and freedom that data lakes provide allows for generating significant value, but it can also lead to anti-patterns and…
Managing The Machine Learning Lifecycle - Episode 84
Jun 9, 2019 • 62 min
Building a machine learning model can be difficult, but that is only half of the battle. Having a perfect model is only useful if you are able to get it into production. In this episode Stepan Pushkarev, founder of Hydrosphere, explains why deploying and…
Evolving An ETL Pipeline For Better Productivity - Episode 83
Jun 4, 2019 • 62 min
Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, joins Raghu Murthy, founder and CEO of…
Data Lineage For Your Pipelines - Episode 82
May 26, 2019 • 49 min
Some problems in data are well defined and benefit from a ready-made set of tools. For everything else, there’s Pachyderm, the platform for data science that is built to scale. In this episode Joe Doliner, CEO and co-founder, explains how Pachyderm…
Build Your Data Analytics Like An Engineer With DBT - Episode 81
May 19, 2019 • 56 min
In recent years the traditional approach to building data warehouses has shifted from transforming records before loading, to transforming them afterwards. As a result, the tooling for those transformations needs to be reimagined. The data build tool…
Using FoundationDB As The Bedrock For Your Distributed Systems - Episode 80
May 6, 2019 • 66 min
The database market continues to expand, offering systems that are suited to virtually every use case. But what happens if you need something customized to your application? FoundationDB is a distributed key-value store that provides the primitives that…
Running Your Database On Kubernetes With KubeDB - Episode 79
Apr 28, 2019 • 50 min
Kubernetes is a driving force in the renaissance around deploying and running applications. However, managing the database layer is still a separate concern. The KubeDB project was created as a way of providing a simple mechanism for running your storage…
Unpacking Fauna: A Global Scale Cloud Native Database - Episode 78
Apr 22, 2019 • 53 min
One of the biggest challenges for any business trying to grow and reach customers globally is how to scale their data storage. FaunaDB is a cloud native database built by the engineers behind Twitter’s infrastructure and designed to serve the needs of…
Index Your Big Data With Pilosa For Faster Analytics - Episode 77
Apr 15, 2019 • 43 min
Database indexes are critical to ensure fast lookups of your data, but they are inherently tied to the database engine. Pilosa is rewriting that equation by providing a flexible, scalable, performant engine for building an index of your data to enable…
Serverless Data Pipelines On DataCoral - Episode 76
Apr 7, 2019 • 53 min
How much time do you spend maintaining your data pipeline? How much end user value does that provide? Raghu Murthy founded DataCoral as a way to abstract the low level details of ETL so that you can focus on the actual problem that you are trying to…
Why Analytics Projects Fail And What To Do About It - Episode 75
Mar 31, 2019 • 36 min
Analytics projects fail all the time, resulting in lost opportunities and wasted resources. There are a number of factors that contribute to that failure and not all of them are under our control. However, many of them are and as data engineers we can…
Building An Enterprise Data Fabric At CluedIn - Episode 74
Mar 25, 2019 • 57 min
Data integration is one of the most challenging aspects of any data platform, especially as the variety of data sources and formats grow. Enterprise organizations feel this acutely due to the silos that occur naturally across business units. The CluedIn…
A DataOps vs DevOps Cookoff In The Data Kitchen - Episode 73
Mar 18, 2019 • 54 min
Delivering a data analytics project on time and with accurate information is critical to the success of any business. DataOps is a set of practices to increase the probability of success by creating value early and often, and using feedback loops to keep…
Customer Analytics At Scale With Segment - Episode 72
Mar 4, 2019 • 47 min
Customer analytics is a problem domain that has given rise to its own industry. In order to gain a full understanding of what your users are doing and how best to serve them you may need to send data to multiple services, each with their own tracking code…
Deep Learning For Data Engineers - Episode 71
Feb 24, 2019 • 42 min
Deep learning is the latest class of technology that is gaining widespread interest. As data engineers we are responsible for building and managing the platforms that power these models. To help us understand what is involved, we are joined this week by…
The Alluxio Distributed Storage System - Episode 70
Feb 18, 2019 • 59 min
Distributed storage systems are the foundational layer of any big data stack. There are a variety of implementations which support different specialized use cases and come with associated tradeoffs. Alluxio is a distributed virtual filesystem which…
Building Machine Learning Projects In The Enterprise - Episode 69
Feb 11, 2019 • 48 min
Machine learning is a class of technologies that promise to revolutionize business. Unfortunately, it can be difficult to identify and execute on ways that it can be used in large companies. Kevin Dewalt founded Prolego to help Fortune 500 companies…
Cleaning And Curating Open Data For Archaeology - Episode 68
Feb 3, 2019 • 60 min
Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data…
Managing Database Access Control For Teams With strongDM - Episode 67
Jan 28, 2019 • 42 min
Controlling access to a database is a solved problem… right? It can be straightforward for small teams and a small number of storage engines, but once either or both of those start to scale then things quickly become complex and difficult to manage. After…
Building Enterprise Big Data Systems At LEGO - Episode 66
Jan 21, 2019 • 48 min
Building internal expertise around big data in a large organization is a major competitive advantage. However, it can be a difficult process due to compliance needs and the need to scale globally on day one. In this episode Jesper Søgaard and Keld…
TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65
Jan 13, 2019 • 41 min
The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO…
Performing Fast Data Analytics Using Apache Kudu - Episode 64
Jan 6, 2019 • 50 min
The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill this need the Kudu project was created with…
Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63
Dec 31, 2018 • 44 min
As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream processing technologies to fullfill that need. However, the storage requirements for continuous, unbounded streams of data…
Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62
Dec 23, 2018 • 63 min
Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and…
Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann - Episode 61
Dec 16, 2018 • 39 min
Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this episode Christian Heinzmann, Head of Data…
Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60
Dec 9, 2018 • 50 min
Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin…
Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59
Dec 2, 2018 • 54 min
Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode…
Set Up Your Own Data-as-a-Service Platform On Dremio with Tomer Shiran - Episode 58
Nov 25, 2018 • 39 min
When your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex questions of it. The default way to manage this situation is by crafting pipelines that will extract the data from source…
Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57
Nov 18, 2018 • 48 min
Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode…
How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56
Nov 11, 2018 • 51 min
A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this…
Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55
Nov 4, 2018 • 58 min
Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different…
Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54
Oct 28, 2018 • 40 min
Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production…
Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.__init__) - Episode 53
Oct 21, 2018 • 45 min
As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a…
Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52
Oct 14, 2018 • 53 min
With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The…
Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov - Episode 51
Oct 9, 2018 • 56 min
One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by…
Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50
Sep 30, 2018 • 52 min
There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your…
A Primer On Enterprise Data Curation with Todd Walter - Episode 49
Sep 23, 2018 • 49 min
As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to…
Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48
Sep 16, 2018 • 47 min
Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform…
Keep Your Data And Query It Too Using Chaos Search with Thomas Hazel and Pete Cheslock - Episode 47
Sep 9, 2018 • 48 min
Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your…
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46
Sep 3, 2018 • 47 min
With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as…
Protecting Your Data In Use At Enveil with Ellison Anne Williams - Episode 45
Aug 27, 2018 • 24 min
There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny…
Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44
Aug 19, 2018 • 42 min
The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around…
Putting Airflow Into Production With James Meickle - Episode 43
Aug 12, 2018 • 48 min
The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted…
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42
Aug 6, 2018 • 56 min
One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed…
Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41
Jul 29, 2018 • 29 min
With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis…
Ceph: A Reliable And Scalable Distributed Filesystem with Sage Weil - Episode 40
Jul 15, 2018 • 48 min
When working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem that will scale with your workload. Even better is when that same system provides multiple paradigms for interacting…
Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39
Jul 8, 2018 • 64 min
Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical…
Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38
Jul 2, 2018 • 46 min
Data is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning projects. This is problematic when the volume scales beyond a handful of records. In this episode Dr. Cheryl Martin, Chief…
Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37
Jun 24, 2018 • 41 min
Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and…
User Analytics In Depth At Heap with Dan Robinson - Episode 36
Jun 17, 2018 • 45 min
Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven’t been tracking a key interaction, having to write custom logic to add that event, and then waiting to…
CockroachDB In Depth with Peter Mattis - Episode 35
Jun 10, 2018 • 43 min
With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to replicate information geographically came…
ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34
Jun 3, 2018 • 40 min
Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this…
The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33
May 27, 2018 • 47 min
Building an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as new requirements are added and greater scalability becomes necessary. Rather than duplicating the efforts of other…
PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32
May 20, 2018 • 42 min
Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your…
Brief Conversations From The Open Data Science Conference: Part 2 - Episode 31
May 13, 2018 • 26 min
The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Andy Eschbacher of Carto. He…
Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30
May 6, 2018 • 32 min
The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Alan Anders, the CTO of…
Metabase Self Service Business Intelligence with Sameer Al-Sakran - Episode 29
Apr 29, 2018 • 44 min
Business Intelligence software is often cumbersome and requires specialized knowledge of the tools and data to be able to ask and answer questions about the state of the organization. Metabase is a tool built with the goal of making the act of discovering…
Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28
Apr 22, 2018 • 39 min
The information about how data is acquired and processed is often as important as the data itself. For this reason metadata management systems are built to track the journey of your business data to aid in analysis, presentation, and compliance. These…
Data Engineering Weekly with Joe Crobak - Episode 27
Apr 14, 2018 • 43 min
The rate of change in the data engineering industry is alternately exciting and exhausting. Joe Crobak found his way into the work of data management by accident as so many of us do. After being engrossed with researching the details of distributed…
Defining DataOps with Chris Bergh - Episode 26
Apr 8, 2018 • 54 min
Managing an analytics project can be difficult due to the number of systems involved and the need to ensure that new information can be delivered quickly and reliably. That challenge can be met by adopting practices and principles from lean manufacturing…
ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25
Apr 1, 2018 • 51 min
Cloud computing and ubiquitous virtualization have changed the ways that our applications are built and deployed. This new environment requires a new way of tracking and addressing the security of our systems. ThreatStack is a platform that collects all…
MarketStore: Managing Timeseries Financial Data with Hitoshi Harada and Christopher Ryan - Episode 24
Mar 25, 2018 • 33 min
The data that is used in financial markets is time oriented and multidimensional, which makes it difficult to manage in either relational or timeseries databases. To make this information more manageable the team at Alapaca built a new data store…
Stretching The Elastic Stack with Philipp Krenn - Episode 23
Mar 18, 2018 • 51 min
Search is a common requirement for applications of all varieties. Elasticsearch was built to make it easy to include search functionality in projects built in any language. From that foundation, the rest of the Elastic Stack has been built, expanding to…
Database Refactoring Patterns with Pramod Sadalage - Episode 22
Mar 12, 2018 • 49 min
As software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and iterative schema evolution provide the necessary mechanisms to ensure that your data layer is as agile as your…
The Future Data Economy with Roger Chen - Episode 21
Mar 4, 2018 • 42 min
Data is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase in applications for machine learning and AI which require large quantities of information to work from. As the…
Honeycomb Data Infrastructure with Sam Stokes - Episode 20
Feb 25, 2018 • 41 min
One of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly provide value to customers or understand the functioning of the business, but it is still a critical component of a…
Data Teams with Will McGinnis - Episode 19
Feb 18, 2018 • 28 min
The responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges it is possible for the two roles to work together effectively and produce valuable business outcomes. In this…
TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18
Feb 11, 2018 • 62 min
As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to…
Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17
Feb 3, 2018 • 53 min
One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This…
Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16
Jan 28, 2018 • 62 min
Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode…
Snorkel: Extracting Value From Dark Data with Alex Ratner - Episode 15
Jan 21, 2018 • 37 min
The majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, that is just a small percentage of the information that is available, so the rest of the sources of knowledge in a…
CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14
Jan 14, 2018 • 45 min
As we scale our systems to handle larger volumes of data, geographically distributed users, and varied data sources the requirement to distribute the computational resources for managing that information becomes more pronounced. In order to ensure that…
Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13
Jan 7, 2018 • 46 min
PostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has allowed it to be used in virtually every environment. At Citus Data they have built an extension to support…
Wallaroo with Sean T. Allen - Episode 12
Dec 24, 2017 • 59 min
Data oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the need to manage their state. In this episode Sean T. Allen, VP of engineering for Wallaroo Labs, explains how…
SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11
Dec 17, 2017
Time series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in production. In this episode Jeroen van der Heijden explains his motivation for writing a new database, SiriDB, the…
Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10
Dec 10, 2017 • 49 min
To process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple systems it can be difficult to ensure that they all have an accurate representation of that schema, which is why…
data.world with Bryon Jacob - Episode 9
Dec 2, 2017 • 46 min
We have tools and platforms for collaborating on software projects and linking them together, wouldn’t it be nice to have the same capabilities for data? The team at data.world are working on building a platform to host and share data sets for public and…
Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8
Nov 22, 2017 • 51 min
With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what…
Buzzfeed Data Infrastructure with Walter Menendez - Episode 7
Nov 14, 2017 • 43 min
Buzzfeed needs to be able to understand how its users are interacting with the myriad articles, videos, etc. that they are posting. This lets them produce new content that will continue to be well-received. To surface the insights that they need to grow…
Astronomer with Ry Walker - Episode 6
Aug 6, 2017 • 42 min
Building a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform that lets you skip straight to processing your valuable business data. Ry Walker, the CEO of Astronomer, explains…
Rebuilding Yelp’s Data Pipeline with Justin Cunningham - Episode 5
Jun 17, 2017 • 42 min
Yelp needs to be able to consume and process all of the user interactions that happen in their platform in as close to real-time as possible. To achieve that goal they embarked on a journey to refactor their monolithic architecture to be more modular and…
ScyllaDB with Eyal Gutkind - Episode 4
Mar 18, 2017 • 35 min
If you like the features of Cassandra DB but wish it ran faster with fewer resources then ScyllaDB is the answer you have been looking for. In this episode Eyal Gutkind explains how Scylla was created and how it differentiates itself in the crowded…
Defining Data Engineering with Maxime Beauchemin - Episode 3
Mar 4, 2017 • 45 min
What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this episode, Maxime Beauchemin joins me to discuss these questions and more.
Dask with Matthew Rocklin - Episode 2
Jan 22, 2017 • 46 min
There is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about how Dask fills the gap between a task oriented workflow tool and an in memory processing framework, and how it brings…
Pachyderm with Daniel Whitenack - Episode 1
Jan 14, 2017 • 44 min
Do you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for building a data lake with a versioned file system. It also lets you use whatever languages you want to run your…
Introducing The Show - Episode 0
Jan 7, 2017 • 4 min
Are you looking for a podcast that discusses the tools, techniques, and culture of data engineering? Then you’ve come to the right spot!