12 Data Engineering Missteps of 2022 That Boosted My Skills
Written on
I would appreciate your input. Please take a moment to respond to my 3-question survey to help me improve outside of this blog. All participants will receive a free gift.
2022: A Year of Learning Through Data Engineering Mistakes
At the start of 2022, I believed that making errors in programming was the most challenging aspect. By the year's end, I understood that the real mistake is failing to leverage the lessons learned in future endeavors.
While I have much to celebrate from this year, I want to highlight something often overlooked in discussions about the appeal of a data career: the numerous errors even seasoned professionals encounter.
Throughout 2022, I developed over 25 data pipelines, and I wrote extensively about data engineering, finding that immersion was the quickest way to build proficiency, confidence, and community.
In the spirit of openness, I present an unordered, non-comprehensive list of mistakes I made this year and the lessons I extracted from them.
1. Package Version Management
I wrote and reviewed numerous Python scripts this year. One common error when transitioning from merging to deployment is neglecting to verify the required package versions carefully. Thanks to insights from senior engineers, I now utilize PyEnv to establish a virtual environment for testing scripts. If deployment fails, I can confidently assert that all dependencies were consistent, indicating an issue on GCP’s end.
2. GCP Cloud Functions Limitations
I faced challenges converting my personal project, a pipeline using the Mint API (actually a third-party package called mintapi), into a cloud function. Despite connecting to Mint and updating a table in BigQuery, I encountered errors when attempting to create a cloud function in my GCP console. This was due to mintapi's reliance on Selenium, which generates and stores files in the working directory. GCP cloud functions do not permit file creation and storage in-memory, necessitating an app on App Engine or alternative deployment methods.
3. PubSub Topic Naming Issues
This mistake was unexpected. While completing a pipeline that gathered data from a Google endpoint, I named my main function and PubSub trigger the same—starting with "google." However, GCP does not allow the PubSub topic name to include "goog." Renaming the topic resolved the issue, and the script deployed successfully.
4. Permission Challenges
Starting a data engineering project can be tough, not for motivational reasons, but due to often lacking the necessary permissions or credentials. While communication among stakeholders can be unpredictable, familiarizing yourself with your organization’s cloud provider's permission tiers will ease troubleshooting of authentication issues.
5. CI/CD GitHub Actions
Surprisingly, a significant portion of my role involves copying and pasting. While I build many cloud functions, I often modify a few lines and commit. However, as I progressed to orchestration that includes components like Docker Images and VMs, I found myself developing CI/CD pipelines almost from scratch. Additionally, updates from Google regarding cloud function deployment via GitHub Actions led to many errors due to insufficient proofreading. Always take the time to review your code as meticulously as you would an essay; you'll likely uncover mistakes as embarrassing as typos.
6. Misuse of SELECT DISTINCT
My co-editor on Learning SQL has discussed the pitfalls of using SELECT DISTINCT, particularly SELECT DISTINCT ON. This year, I faced issues when a query with SELECT DISTINCT returned incorrect row counts, necessitating revisions to my QA report for a stakeholder. Be cautious of over-filtering or misapplying commands in your queries.
7. Airflow Errors
While cloud platforms like GCP simplify Airflow installation and DAG creation, it's still easy to err while crafting a Python-based DAG file. My initial mistakes stemmed from misunderstanding task dependencies. Additionally, to prevent DAG failures, it's crucial to write a task that checks for data availability rather than assuming updates are automatic. I progressed to creating separate modules for Python functions instead of embedding them in the DAG file, which requires careful attention to naming conventions and directory structures.
8. Inconsistent Logging Practices
Throughout the year, I spent considerable time analyzing log entries while designing alert systems. I learned the intricacies of GCP’s logs explorer and the importance of establishing a Google logging client to display logs in the console. It's advisable to set logs at the INFO level to clarify which messages are significant. I've found that dynamic logging, which includes timestamps and operation responses, greatly assists in troubleshooting.
9. Overlooking Config Files
New developers often underestimate the importance of readability in production scripts. Cluttering scripts with string and integer variables makes maintenance difficult. By storing these variables in a separate config file, the main script becomes more manageable and easier to update. Config files enhance the configurability of variables, allowing for safer updates without risking script breakage.
10. Excessive Data Retrieval
I recently revisited an older pipeline that was timing out and realized I was redundantly querying data from months past, overwriting unchanged data in my table. By adjusting the process to retain unchanged data and only append new records, I improved efficiency. This change also made the process stateful, ensuring consistent output barring upstream changes.
11. Neglecting Quality Assurance
A vital skill in any data discipline is effective quality assurance (QA). As a novice engineer, many of my mistakes arose from rushed or nonexistent QA processes. QA is tedious but essential; it helps create robust pipelines and serves as a protective measure. A well-executed QA process reassures stakeholders and shields you from blame, ensuring transparency and alignment.
12. Docker: Missing Bash Commands
This might be the most humbling mistake I encountered. When developing a Docker file that executed multiple Python files, I failed to include "python" before each script name in the bash command. This oversight resulted in the scripts not running, despite passing all checks in GitHub. A valuable lesson learned about the importance of even the smallest details.
Conclusion
My guiding principle this year has been "Progress, not perfection." When I joined my team in 2021, I was so focused on making an impact that I neglected to reflect on my growth as someone without a computer science background now working in a technical role. By sharing my mistakes instead of my successes, I hope to illustrate that, like programming, professional development is an iterative journey. Here’s to another year filled with lessons and insights.