Azure Setup and Account Access
Week 1 Assignment: The Data Cleaning Pipeline
Career relevance: Week 1 in the NL data job market
Going Further: Optional Deep Dives
Data pipelines are typically run from the command line, not by clicking buttons in an IDE. In this section, you'll learn professional habits for running Python scripts and handling command-line arguments.
# Run a Python script
python script.py
# Explicitly use Python 3.11
python3.11 script.py
# Run a module (useful for packages)
python -m mypackage.script
# Pass arguments to your script
python process.py input.csv output.json
# With named flags
python process.py --input data.csv --output result.json --verbose
<aside> 💡 When your virtual environment is activated, python will point to the correct version. Always activate your venv before running scripts!
</aside>
argparse ModuleThe argparse module is Python's built-in way to handle command-line arguments professionally.
<aside> 💡 Why use argparse?
-h / --help menu automatically based on your code.--verbose) and short-forms (like -v) exactly like standard Linux/Mac tools.
</aside># process.py
import argparse
def main():
parser = argparse.ArgumentParser(
description="Process a data file."
)
parser.add_argument("input", help="Input file path")
parser.add_argument("output", help="Output file path")
args = parser.parse_args()
print(f"Input: {args.input}")
print(f"Output: {args.output}")
if __name__ == "__main__":
main()
Run it:
python process.py data.csv result.json
# Output:
# Input: data.csv
# Output: result.json
argparse generates help text automatically:
python process.py --help
# Output:
# usage: process.py [-h] input output
#
# Process a data file.
#
# positional arguments:
# input Input file path
# output Output file path
#
# optional arguments:
# -h, --help show this help message and exit
parser.add_argument("input_file", help="The file to process")
parser.add_argument("output_file", help="Where to save results")
Usage: python script.py input.csv output.json
parser.add_argument(
"--verbose", "-v",
action="store_true",
help="Enable verbose output"
)
parser.add_argument(
"--limit", "-l",
type=int,
default=100,
help="Maximum number of records (default: 100)"
)
Usage: python script.py input.csv output.json --verbose --limit 50
<aside> ⚠️ Optional arguments start with -- (or - for short form). Positional arguments don't have dashes.
</aside>
#!/usr/bin/env python3
"""
Data processing pipeline script.
Usage:
python pipeline.py input.csv output.json
python pipeline.py input.csv output.json --verbose --skip-header
"""
import argparse
import sys
def create_parser() -> argparse.ArgumentParser:
"""Create and configure the argument parser."""
parser = argparse.ArgumentParser(
description="Process CSV data and output JSON.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python pipeline.py data.csv result.json
python pipeline.py data.csv result.json --verbose
python pipeline.py data.csv result.json --limit 100
"""
)
# Positional arguments
parser.add_argument(
"input",
help="Path to input CSV file"
)
parser.add_argument(
"output",
help="Path to output JSON file"
)
# Optional arguments
parser.add_argument(
"--verbose", "-v",
action="store_true",
help="Enable verbose output"
)
parser.add_argument(
"--limit", "-l",
type=int,
default=None,
help="Limit number of records to process"
)
parser.add_argument(
"--skip-header",
action="store_true",
help="Skip the first row of the CSV"
)
return parser
def process_data(input_path: str, output_path: str,
verbose: bool = False, limit: int | None = None,
skip_header: bool = False) -> None:
"""Process data from input to output."""
if verbose:
print(f"Reading from: {input_path}")
print(f"Writing to: {output_path}")
if limit:
print(f"Limiting to {limit} records")
# Actual processing would go here
print("Processing complete!")
def main() -> int:
"""Main entry point."""
parser = create_parser()
args = parser.parse_args()
try:
process_data(
input_path=args.input,
output_path=args.output,
verbose=args.verbose,
limit=args.limit,
skip_header=args.skip_header
)
return 0 # Success
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
return 1 # Failure
if __name__ == "__main__":
sys.exit(main())
<aside> 💡 The pattern of main() returning an exit code and sys.exit(main()) is professional practice. Exit code 0 means success, non-zero means failure.
</aside>
# Integer argument
parser.add_argument("--count", type=int, default=10)
# Float argument
parser.add_argument("--threshold", type=float, default=0.5)
# File path (validates file exists)
parser.add_argument("--config", type=argparse.FileType("r"))
parser.add_argument(
"--format",
choices=["json", "csv", "xml"],
default="json",
help="Output format (default: json)"
)
parser.add_argument(
"--log-level",
choices=["DEBUG", "INFO", "WARNING", "ERROR"],
default="INFO",
help="Logging level"
)
parser.add_argument(
"--api-key",
required=True,
help="API key (required)"
)
Exit codes tell the calling process whether your script succeeded or failed.
import sys
def main():
# ... do work ...
if error_occurred:
print("Error: something went wrong", file=sys.stderr)
sys.exit(1) # Non-zero = failure
print("Success!")
sys.exit(0) # Zero = success
Common exit codes:
0: Success1: General error2: Command-line usage error<aside> ⌨️ Hands-on: Create a script greet.py that accepts a --name argument (default: "World") and prints "Hello, {name}!". Add a --loud flag that makes it print in uppercase.
</aside>
Sometimes you want to configure scripts via environment variables instead of command-line arguments.
<aside> 💡 Why use environment variables?
import os
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument(
"--api-key",
default=os.environ.get("API_KEY"),
help="API key (or set API_KEY environment variable)"
)
args = parser.parse_args()
if not args.api_key:
print("Error: API key required", file=sys.stderr)
sys.exit(1)
print(f"Using API key: {args.api_key[:4]}...")
if __name__ == "__main__":
main()
Usage:
# Via argument
python script.py --api-key secret123
# Via environment variable
export API_KEY=secret123
python script.py
# Inline (Linux/macOS)
API_KEY=secret123 python script.py
<aside> ⚠️ Never print full API keys or passwords! The example above only prints the first 4 characters.
</aside>
print(..., file=sys.stderr)-h / --help (argparse does this automatically)if __name__ == "__main__" to allow importing without running<aside> 🤓 Curious Geek: argparse, optparse, and getopt
Python has had three command-line argument parsers in its standard library. getopt (Python 1.x) was a thin port of the C library function: minimal, tedious, no built-in help. optparse (Python 2.3, 2003) added richer parsing but used confusing class-based callbacks. argparse (Python 2.7 / 3.2, 2010) replaced both via PEP 389, introducing the declarative add_argument() style this chapter teaches. optparse was officially deprecated in Python 3.2; it still ships for backwards compatibility but the docs explicitly point users at argparse. The practical takeaway: any code using optparse you find in a 2010s tutorial should be rewritten as argparse.
</aside>
The Week 1 practice set ends with one CLI exercise where every habit above shows up in the same script.
<aside>
📝 Practice: The week's Practice chapter has Exercise 6 (the Pipeline CLI), a tiny CSV cleaner that you fix in three places: parse --input / --output with argparse, exit non-zero on bad input, and add --verbose for debug logging. It is the closest the practice set gets to a real production script.
</aside>
<aside> 🚀 Try it in the widget: Interactive Quiz: CLI Habits
</aside>
https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_1_ch6_cli_habits_quiz&embed=1
input() (asking for user typing) bad for automated data pipelines?0 usually signal to the operating system?argparse help other people use your script?Next up: Errors and Debugging, where you learn to read Python tracebacks line by line and use the VS Code debugger to step through a failing script instead of guessing what went wrong.