Data Engineering Podcast

Data Engineering Podcast

www.dataengineeringpodcast.com
Weekly deep dives on data management with the engineers and entrepreneurs who are shaping the industry
Building Machine Learning Projects In The Enterprise - Episode 69
Feb 11 • 48 min
Machine learning is a class of technologies that promise to revolutionize business. Unfortunately, it can be difficult to identify and execute on ways that it can be used in large companies. Kevin Dewalt founded Prolego to help Fortune 500 companies…
Cleaning And Curating Open Data For Archaeology - Episode 68
Feb 3 • 60 min
Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data…
Managing Database Access Control For Teams With strongDM - Episode 67
Jan 28 • 42 min
Controlling access to a database is a solved problem… right? It can be straightforward for small teams and a small number of storage engines, but once either or both of those start to scale then things quickly become complex and difficult to manage. After…
Building Enterprise Big Data Systems At LEGO - Episode 66
Jan 21 • 48 min
Building internal expertise around big data in a large organization is a major competitive advantage. However, it can be a difficult process due to compliance needs and the need to scale globally on day one. In this episode Jesper Søgaard and Keld…
TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65
Jan 13 • 41 min
The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO…
Performing Fast Data Analytics Using Apache Kudu - Episode 64
Jan 6 • 50 min
The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill this need the Kudu project was created with…
Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63
Dec 31, 2018 • 44 min
As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream processing technologies to fullfill that need. However, the storage requirements for continuous, unbounded streams of data…
Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62
Dec 23, 2018 • 63 min
Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and…
Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann - Episode 61
Dec 16, 2018 • 39 min
Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this episode Christian Heinzmann, Head of Data…
Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60
Dec 9, 2018 • 50 min
Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin…
Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59
Dec 2, 2018 • 54 min
Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode…
Set Up Your Own Data-as-a-Service Platform On Dremio with Tomer Shiran - Episode 58
Nov 25, 2018 • 39 min
When your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex questions of it. The default way to manage this situation is by crafting pipelines that will extract the data from source…
Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57
Nov 18, 2018 • 48 min
Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode…
How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56
Nov 11, 2018 • 51 min
A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this…
Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55
Nov 4, 2018 • 58 min
Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different…
Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54
Oct 28, 2018 • 40 min
Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production…
Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.__init__) - Episode 53
Oct 21, 2018 • 45 min
As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a…
Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52
Oct 14, 2018 • 53 min
With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The…
Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov - Episode 51
Oct 9, 2018 • 56 min
One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by…
Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50
Sep 30, 2018 • 52 min
There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your…
A Primer On Enterprise Data Curation with Todd Walter - Episode 49
Sep 23, 2018 • 49 min
As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to…
Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48
Sep 16, 2018 • 47 min
Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform…
Keep Your Data And Query It Too Using Chaos Search with Thomas Hazel and Pete Cheslock - Episode 47
Sep 9, 2018 • 48 min
Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your…
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46
Sep 3, 2018 • 47 min
With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as…
Protecting Your Data In Use At Enveil with Ellison Anne Williams - Episode 45
Aug 27, 2018 • 24 min
There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny…
Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44
Aug 19, 2018 • 42 min
The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around…
Putting Airflow Into Production With James Meickle - Episode 43
Aug 12, 2018 • 48 min
The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted…
Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42
Aug 6, 2018 • 56 min
One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed…
Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41
Jul 29, 2018 • 29 min
With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis…
Ceph: A Reliable And Scalable Distributed Filesystem with Sage Weil - Episode 40
Jul 15, 2018 • 48 min
When working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem that will scale with your workload. Even better is when that same system provides multiple paradigms for interacting…
Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39
Jul 8, 2018 • 64 min
Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical…
Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38
Jul 2, 2018 • 46 min
Data is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning projects. This is problematic when the volume scales beyond a handful of records. In this episode Dr. Cheryl Martin, Chief…
Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37
Jun 24, 2018 • 41 min
Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and…
User Analytics In Depth At Heap with Dan Robinson - Episode 36
Jun 17, 2018 • 45 min
Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven’t been tracking a key interaction, having to write custom logic to add that event, and then waiting to…
CockroachDB In Depth with Peter Mattis - Episode 35
Jun 10, 2018 • 43 min
With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to replicate information geographically came…
ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34
Jun 3, 2018 • 40 min
Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this…
The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33
May 27, 2018 • 47 min
Building an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as new requirements are added and greater scalability becomes necessary. Rather than duplicating the efforts of other…
PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32
May 20, 2018 • 42 min
Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your…
Brief Conversations From The Open Data Science Conference: Part 2 - Episode 31
May 13, 2018 • 26 min
The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Andy Eschbacher of Carto. He…
Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30
May 6, 2018 • 32 min
The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Alan Anders, the CTO of…
Metabase Self Service Business Intelligence with Sameer Al-Sakran - Episode 29
Apr 29, 2018 • 44 min
Business Intelligence software is often cumbersome and requires specialized knowledge of the tools and data to be able to ask and answer questions about the state of the organization. Metabase is a tool built with the goal of making the act of discovering…
Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28
Apr 22, 2018 • 39 min
The information about how data is acquired and processed is often as important as the data itself. For this reason metadata management systems are built to track the journey of your business data to aid in analysis, presentation, and compliance. These…
Data Engineering Weekly with Joe Crobak - Episode 27
Apr 14, 2018 • 43 min
The rate of change in the data engineering industry is alternately exciting and exhausting. Joe Crobak found his way into the work of data management by accident as so many of us do. After being engrossed with researching the details of distributed…
Defining DataOps with Chris Bergh - Episode 26
Apr 8, 2018 • 54 min
Managing an analytics project can be difficult due to the number of systems involved and the need to ensure that new information can be delivered quickly and reliably. That challenge can be met by adopting practices and principles from lean manufacturing…
ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25
Apr 1, 2018 • 51 min
Cloud computing and ubiquitous virtualization have changed the ways that our applications are built and deployed. This new environment requires a new way of tracking and addressing the security of our systems. ThreatStack is a platform that collects all…
MarketStore: Managing Timeseries Financial Data with Hitoshi Harada and Christopher Ryan - Episode 24
Mar 25, 2018 • 33 min
The data that is used in financial markets is time oriented and multidimensional, which makes it difficult to manage in either relational or timeseries databases. To make this information more manageable the team at Alapaca built a new data store…
Stretching The Elastic Stack with Philipp Krenn - Episode 23
Mar 18, 2018 • 51 min
Search is a common requirement for applications of all varieties. Elasticsearch was built to make it easy to include search functionality in projects built in any language. From that foundation, the rest of the Elastic Stack has been built, expanding to…
Database Refactoring Patterns with Pramod Sadalage - Episode 22
Mar 12, 2018 • 49 min
As software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and iterative schema evolution provide the necessary mechanisms to ensure that your data layer is as agile as your…
The Future Data Economy with Roger Chen - Episode 21
Mar 4, 2018 • 42 min
Data is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase in applications for machine learning and AI which require large quantities of information to work from. As the…
Honeycomb Data Infrastructure with Sam Stokes - Episode 20
Feb 25, 2018 • 41 min
One of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly provide value to customers or understand the functioning of the business, but it is still a critical component of a…
Data Teams with Will McGinnis - Episode 19
Feb 18, 2018 • 28 min
The responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges it is possible for the two roles to work together effectively and produce valuable business outcomes. In this…
TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18
Feb 11, 2018 • 62 min
As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to…
Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17
Feb 3, 2018 • 53 min
One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This…
Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16
Jan 28, 2018 • 62 min
Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode…
Snorkel: Extracting Value From Dark Data with Alex Ratner - Episode 15
Jan 21, 2018 • 37 min
The majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, that is just a small percentage of the information that is available, so the rest of the sources of knowledge in a…
CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14
Jan 14, 2018 • 45 min
As we scale our systems to handle larger volumes of data, geographically distributed users, and varied data sources the requirement to distribute the computational resources for managing that information becomes more pronounced. In order to ensure that…
Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13
Jan 7, 2018 • 46 min
PostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has allowed it to be used in virtually every environment. At Citus Data they have built an extension to support…
Wallaroo with Sean T. Allen - Episode 12
Dec 24, 2017 • 59 min
Data oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the need to manage their state. In this episode Sean T. Allen, VP of engineering for Wallaroo Labs, explains how…
SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11
Dec 17, 2017
Time series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in production. In this episode Jeroen van der Heijden explains his motivation for writing a new database, SiriDB, the…
Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10
Dec 10, 2017 • 49 min
To process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple systems it can be difficult to ensure that they all have an accurate representation of that schema, which is why…
data.world with Bryon Jacob - Episode 9
Dec 2, 2017 • 46 min
We have tools and platforms for collaborating on software projects and linking them together, wouldn’t it be nice to have the same capabilities for data? The team at data.world are working on building a platform to host and share data sets for public and…
Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8
Nov 22, 2017 • 51 min
With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what…
Buzzfeed Data Infrastructure with Walter Menendez - Episode 7
Nov 14, 2017 • 43 min
Buzzfeed needs to be able to understand how its users are interacting with the myriad articles, videos, etc. that they are posting. This lets them produce new content that will continue to be well-received. To surface the insights that they need to grow…
Astronomer with Ry Walker - Episode 6
Aug 6, 2017 • 42 min
Building a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform that lets you skip straight to processing your valuable business data. Ry Walker, the CEO of Astronomer, explains…
Rebuilding Yelp’s Data Pipeline with Justin Cunningham - Episode 5
Jun 17, 2017 • 42 min
Yelp needs to be able to consume and process all of the user interactions that happen in their platform in as close to real-time as possible. To achieve that goal they embarked on a journey to refactor their monolithic architecture to be more modular and…
ScyllaDB with Eyal Gutkind - Episode 4
Mar 18, 2017 • 35 min
If you like the features of Cassandra DB but wish it ran faster with fewer resources then ScyllaDB is the answer you have been looking for. In this episode Eyal Gutkind explains how Scylla was created and how it differentiates itself in the crowded…
Defining Data Engineering with Maxime Beauchemin - Episode 3
Mar 4, 2017 • 45 min
What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this episode, Maxime Beauchemin joins me to discuss these questions and more.
Dask with Matthew Rocklin - Episode 2
Jan 22, 2017 • 46 min
There is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about how Dask fills the gap between a task oriented workflow tool and an in memory processing framework, and how it brings…
Pachyderm with Daniel Whitenack - Episode 1
Jan 14, 2017 • 44 min
Do you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for building a data lake with a versioned file system. It also lets you use whatever languages you want to run your…
Introducing The Show - Episode 0
Jan 7, 2017 • 4 min
Are you looking for a podcast that discusses the tools, techniques, and culture of data engineering? Then you’ve come to the right spot!