Azure Setup and Account Access
Week 1 Assignment: The Data Cleaning Pipeline
Career relevance: Week 1 in the NL data job market
Going Further: Optional Deep Dives
Organizing code into functions and modules is essential for building maintainable data pipelines.
You learned functions in Core. Here's what's important for data engineering:
def clean_value(value: str, default: str = "") -> str:
"""Clean and normalize a string value.
Args:
value: The string to clean
default: Value to return if input is empty
Returns:
Cleaned string, lowercase and stripped
"""
if not value:
return default
return value.strip().lower()
Three things worth noticing in the example above:
value and default are parameters in the function signature; the actual values you pass at the call site (clean_value(" HELLO ")) are called arguments. Parameter is the slot, argument is what you put in it.default: str = "" declares a default argument value: if the caller omits the second argument, Python uses "". Watch out for the mutable-default gotcha (Knowledge Check #4 below): never use [], {}, or set() as a default.def is the docstring: Python stores it in __doc__, IDEs surface it on hover, and help(clean_value) prints it. The cheapest way to leave a note for the next reader (often future-you).<aside> ๐ก Always write docstrings. They help your future self and teammates.
</aside>
A module is simply a .py file. You can import functions from it.
# `utils.py`
def clean_value(value):
return value.strip().lower()
# `main.py`
from utils import clean_value
print(clean_value(" HELLO "))
__name__ == "__main__" PatternThis pattern relies on two dunder names (Python's convention for double-underscore identifiers) to let a file work both as a module AND as a script:
# `utils.py`
def clean_value(value):
return value.strip().lower()
if __name__ == "__main__":
# Only runs when executed directly
print(clean_value(" TEST "))
<aside> โจ๏ธ Hands on: Create utils.py with a function, import it in main.py.
</aside>
<aside> ๐ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=1&chapter=functions_and_modules&exercise=modules_demo&lang=python
</aside>
The dunder names in the pattern above have an interesting backstory.
<aside>
๐ค Curious Geek: The __name__ == "__main__" idiom
Python files are dual-purpose by design: a .py file can be imported as a library (from utils import clean_value) or run directly as a script (python utils.py). The __name__ variable is how Python tells the file which mode it is in: set to the module's name when imported, set to the literal string "__main__" when executed directly. The idiom dates back to Python's earliest releases in the early 1990s and is unique to Python: most languages either have a dedicated main() function (C, Java, Go) or a separate "entry-point" declaration. Putting setup code under the guard means importers get the function definitions without triggering the file's smoke-test or CLI.
</aside>
You will hit this pattern again every time you write a script that should also be importable: it is a cheap way to make a file dual-purpose.
<aside>
๐ Practice: The week's Practice chapter has two exercises that lean on functions: Ex 1 (the Temperature Logger: write a "production-ready" c_to_f() with a docstring and a default) and Ex 4 (Grade Processor: refactor a 50-line script into named functions). Both run in your venv in a few minutes each.
</aside>
<aside> ๐ Try it in the widget: Interactive Quiz: Functions and Modules
</aside>
https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_1_ch4_functions_modules_quiz&embed=1
if __name__ == "__main__": block allow a single file to do, and what is __name__ set to when the file is imported?utils.py with a function clean(), give two different ways to import and use it from main.py. When would you reach for each?def add_record(records=[]):) a footgun?*args, **kwargs, and default arguments.import resolves files and the __name__ == "__main__" idiom.Next up: Type Hints for Clearer Code, where you turn the optional value: str annotations into a habit and meet the tooling (mypy, IDE inference) that uses them to catch bugs before your pipeline runs.
The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.