The Grouparoo Blog


97 things every data engineer should know

Tagged in Company 
By Brian Leonard on 2021-10-07

Last month, we decided that we should all read a book and talk about it as a company. It was a fun experience and I think we made a good choice by picking 97 Things Every Data Engineer Should Know.

This was the first book I have read in this series and I liked the format. It is made up of 97 small vignettes that are 2-3 pages each. This provided a nice overview of the breadth of topics that are relevant to data engineering including data warehouses/lakes, pipelines, metadata, security, compliance, quality, and working with other teams.

97 Things Every Data Engineer Should Know

Themes

I was drawn to the articles that speak to a theme in the data world that I am passionate about: how data pipelines and data team practices are evolving to be more like traditional product development.

Reproducible pipelines

  • Automate Your Infrastructure by Christiano Anderson
  • Data Pipeline Design Patterns for Reusability and Extensibility by Mukul Sood
  • Engineering Reproducible Data Science Projects by Dr. Tianhui Michael Li
  • The Three Rs of Data Engineering by Tobias Macey

Data testing and quality

  • Automate Your Pipeline Tests by Tom White
  • Data Quality for Data Engineers by Katharine Jarmul
  • Data Validation Is More Than Summary Statistics by Emily Riederer
  • The Six Words That Will Destroy Your Career by Bartosz Mikulski
  • Your Data Tests Failed! Now What? by Sam Bail, PhD

Agile development and product management

  • Caution: Data Science Projects Can Turn into the Emperor’s New Clothes by Shweta Katre
  • Cultivate Good Working Relationships with Data Consumers by Ido Shlomo
  • Demystify the Source and Illuminate the Data Pipeline by Meghan Kwartler
  • How to Build Your Data Platform like a Product by Barr Moses and Atul Gupte
  • Listen to Your Users—but Not Too Much by Amanda Tomlinson
  • Tech Should Take a Back Seat for Data Project Success by Andrew Stevenson
  • Ten Must-Ask Questions for Data-Engineering Projects by Haidar Hadi
  • What to Do When You Don’t Get Any Credit by Jesse Anderson
  • When to Talk and When to Listen by Steven Finkelstein

Feedback

There were a few things that we noticed that could be improved.

The articles are in alphabetical order. I believe it would have been better if they would have had some groupings or take the reader on an arc of some sort. For example, grouping the ones about metadata, discoverability, and column naming might have made a lot of sense.

I read the old-fashioned hard copy, but I was told by people using the Kindle version that the author pictures were of random size. I assume it was the size the authors sent in. This varied from very small to taking over the whole page, creating a disjointed experience. I don't know how such things work, but I feel like Latex might be involved.

Notes

I took short notes on the top of each article about it and then copied them to a spreadsheet. Like any good data engineer.

#TitleNotes
1A (Book) Case for Eventual ConsistencyStrong vs eventual consistency
2A/B and How to BeMost are wrong. If it's working, be skeptical. Test system with A/A test.
3About the Storage LayerEfficiency details for queries
4Analytics as the Secret Glue for Microservice ArchitecturesWhat to measure: company metrics, team metrics, experiment metrics
5Automate Your InfrastructureDevOps is good
6Automate Your Pipeline TestsTreating data engineering like software engineering. Open question: how to seed data in a staging environment?
7Be Intentional About the Batching Model in Your Data PipelinesDifferent batching models. Could we do better for Grouparoo?
8Beware of Silver-Bullet SyndromeDo not build your professional identity on a specific toolset. Be adaptable.
9Building a Career as a Data EngineerSkills: experience on software lifecyle, SQL, open source
10Business Dashboards for Data PipelinesDashboard and graphics help data quality
11Caution: Data Science Projects Can Turn into the Emperor’s New ClothesProjects: iterate, provide visibility, env for rapid changes, share scripts
12Change Data CaptureShould Grouparoo use the WAL or other native CDC approaches? We handle the "_deleted" table approach already.
13Column Names as ContractsStandardize columns names to minimize confusion
14Consensual, Privacy-Aware Data CollectionAt some point does Grouparoo get properties noted as PII and what it means for a profile to opt out? What does that do?
15Cultivate Good Working Relationships with Data ConsumersPractice empathy
16Data Engineering != SparkData eng = Computation + Storage + Messaging + Coding + Architecture + Domain Knowledge + Use Cases
17Data Engineering for Autonomy and Rapid InnovationSounds like the case of ELT
18Data Engineering from a Data Scientist’s PerspectiveData engineering has gotten more complex recently
19Data Pipeline Design Patterns for Reusability and ExtensibilitySoftware design patterns apply to data engineering
20Data Quality for Data EngineersImplement common sense tests for data quality. What would that look like?
21Data Security for Data EngineersThink about the security of data
22Data Validation Is More Than Summary StatisticsQuality testing requires context
23Data Warehouses Are the Past, Present, and FutureWarehouses keep evolving to meet users needs
24Defining and Managing Messages in Log-Centric ArchitecturesStandardize message definitions in an evented system
25Demystify the Source and Illuminate the Data PipelineLearn more about the sources of your data
26Develop Communities, Not Just CodeThink about creating a data culture, not just a pipeline
27Effective Data Engineering in the Cloud WorldThere are lots of pieces to work with these days
28Embrace the Data Lake ArchitectureData lakes are scalable
29Embracing Data SilosMaybe it's not always right to get your data into one place. If so, find a way to abstract the silos to have one way to access it all.
30Engineering Reproducible Data Science ProjectsFollow engineering practices to have more dependable proejcts
31Five Best Practices for Stable Data ProcessingRollback on error, keep data consistent
32Focus on Maintainability and Break Up Those ETL TasksDo one step per transform to maintain simplicity
33Friends Don’t Let Friends Do Dual-WritesUse CDC events to write once and then chain to dependencies.
34Fundamental KnowledgeKnowledge of fundamental concepts allows you to embrace change
35Getting the “Structured” Back into SQLTips on writing SQL.
36Give Data Products a Frontend with Latent DocumentationDocument more to help everyone
37How Data Pipelines EvolveBuild ELT at mid-range and move to data lakes when you need scale
38How to Build Your Data Platform like a ProductPM your data with business. Increase visibility.
39How to Prevent a Data MutinyKey trends: modular architecture, declarative configuration, automated systems
40Know the Value per Byte of Your DataCheck if you are actually using your data
41Know Your Latencieskey questions: how old is data? how fast are queries? how many concurrent queries can we handle?
42Learn to Use a NoSQL Database, but Not like an RDBMSWrite answers to questions in NoSQL databases for fast access
43Let the Robots Enforce the RulesWork with people to standardize and use code to enforce rules
44Listen to Your Users—but Not Too MuchCreate a data team vision and strategy. Take requests and see how they fit into that.
45Low-Cost Sensors and the Quality of DataOrder redundant equipment
46Maintain Your Mechanical SympathySometimes it helps to understand underlying physics
47Metadata ≥ DataPlan your data strategy early and make discovery easy
48Metadata Services as a Core Component of the Data PlatformMetadata helps discovery, security, and agility
49Mind the Gap: Your Data Lake Provides No ACID GuaranteesLakes are not databases
50Modern Metadata for the Modern Data StackMetadata helps collaboration
51Most Data Problems Are Not Big Data ProblemsMost problems are best solved with a relational database
52Moving from Software Engineering to Data EngineeringSwitching from product eng to data eng can. be fun and exciting
53Observability for Data EngineersPillars of discoverability: freshness, distribution, volume, schema, lineage. "Lineage" sounds useful for Grouparoo.
54Perfect Is the Enemy of GoodMake MVPs and iterate.
55Pipe DreamsKafka was good because it had replaying of messages.
56Preventing the Data Lake AbyssUse data contracts and tools (Apache Aurora or Google Protocol Buffers) to keep lakes under control
57Prioritizing User Experience in Messaging SystemsRealtime data messaging creates better experiences
58Privacy Is Your ProblemYou can often still identify people even when PII is removed
59QA and All Its SexinessTesting and QA is good. There are two types: practical and logical.
60Seven Things Data Engineers Need to Watch Out for in ML ProjectsTop issue: misunderstanding what a data attribute means.
61Six Dimensions for Picking an Analytical Data WarehouseThink about scalability, how it's priced, maintenance, and speed.
62Small Files in a Big Data WorldHaving many small files on a system leads to wacky errors
63Streaming Is Different from BatchYou have to think about things differently when streaming instead of batching.
64Tardy DataConsider adding meta data column for storage: arrival_time of data to know to "go back" and process it.
65Tech Should Take a Back Seat for Data Project SuccessFocus on self-service and engaging business users to drive successful projects
66Ten Must-Ask Questions for Data-Engineering ProjectsUnderstand project parameters before you code
67The Data Pipeline Is Not About SpeedParallelization is now more important because of cloud horizontal scaling
68The Dos and Don’ts of Data EngineeringDo DataOps to make things more reliable and agile, less heroic.
69The End of ETL as We Know ItUse events from the product to notify data systems of changes.
70The Haiku Approach to Writing SoftwareUnderstand constraints, start strong, keep it simple, and be creative.
71The Hidden Cost of Data Input/OutputStorage choices impact performance.
72The Holy War Between Proprietary and Open Source Is a LieUse tools that are best for your project and stay out of cargo cults.
73The Implications of the CAP TheoremMost common trade-off: Speed vs. consistency across nodes.
74The Importance of Data LineageTracking lineage help answer questions when things go wrong.
75The Many Meanings of MissingnessThere are several reasons for a null value. It could be "correct" or an error.
76The Six Words That Will Destroy Your CareerYou lose credibility when the data is wrong. Test and monitor to keep it right.
77The Three Invaluable Benefits of Open Source for Testing Data QualityUse open source tools to maintain data quality
78The Three Rs of Data EngineeringData needs to be reliable. Other engineers must be able to reproduce your results. Build repeatable infrastructure.
79The Two Types of Data Engineering and Data EngineersTwo types of data engineers: SQL (relational databases) and big data (python, hadoop)
80The Yin and Yang of Big Data ScalabilityComplex systems have many knows to be tuned to maximize throughput.
81Threading and Concurrency in Data ProcessingYou might hit OS limits when scaling servers
82Three Important Distributed Programming ConceptsConcepts: Map/Reduce (Spark, Hadoop), shared memory (Redis), message passing (Kafka)
83Time (Semantics) Won’t WaitIn event stream processing, there are tradeoffs between completeness and latency. Look into watermarks to control.
84Tools Don’t Matter, Patterns and Practices DoFocus on concepts, not tools. Ask "why" questions about new concepts to learn.
85Total Opportunity Cost of OwnershipGoing all in a tool or paradigm might create problems as tech evolves.
86Understanding the Ways Different Data Domains Solve ProblemsData science, infra, and eng teams have different goals and mindsets that influence their approach.
87What Is a Data Engineer? Clue: We’re Data Science EnablersData engineers and scientists can work together to produce better results
88What Is a Data Mesh, and How Not to Mesh It UpYou can have a data lake and many pipelines used by different business domains.
89What Is Big Data?Stay away from hype. Just get the job done.
90What to Do When You Don’t Get Any CreditTo get credit, talk in terms of business value, not technology
91When Our Data Science Team Didn’t Produce ValueBalance long-term solutions with short-term needs
92When to Avoid the Naive ApproachStorage format and schema are good to get right from the beginning.
93When to Be Cautious About Sharing DataMaybe everyone shouldn't have access to data that requires expertise to interpret.
94When to Talk and When to ListenSmaller scope helps get things shipped more quickly.
95Why Data Science Teams Need Generalists, Not SpecialistsSpecialization can slow things down. Lean towards full stack ownership.
96With Great Data Comes Great ResponsibilityConsider ethics while building data pipelines.
97Your Data Tests Failed! Now What?There are many possible reasons for a failed test.



Get Started with Grouparoo

Start syncing your data with Grouparoo Cloud

Start Free Trial

Or download and try our open source Community edition.