A python decorator puzzle
import pandas as pd
I recently wrote an ETL job to be run in a data processing pipeline. The job fetches data from four database tables in our data lake, stores them in pandas DataFrames and outputs a single DataFrame. So in pseudocode:
def fetch_data_x(t_start: pd.Timestamp, t_end: pd.Timestamp):
print(f"Fetching data X between {t_start} and {t_end}")
return t_start, t_end
def fetch_data_y(t_start: pd.Timestamp, t_end: pd.Timestamp):
print(f"Fetching data Y between {t_start} and {t_end}")
return t_start, t_end
Since the fetching operations are the most time consuming part of the pipeline, I wanted to cache the fetched results. I wrote a decorator:
What's wrong with this piece of code?¶
from pathlib import Path
import pickle
def cache_result(cache_dir='./tmp'):
def decorator(func):
def wrapped(t_start, t_end):
cache_dir = Path(cache_dir)
function_details = f"{func.__code__.co_name},{t_start.isoformat()},{t_end.isoformat}"
cache_filepath = cache_dir.joinpath(function_details)
try:
print(f"Reading from cache: {cache_filepath}")
with open(cache_filepath, 'rb') as f:
res = pickle.load(f)
except:
res = func(t_start, t_end)
print(f"Writing to cache: {cache_filepath}")
with open(cache_filepath, 'wb') as f:
pickle.dump(res, f)
return res
return wrapped
return decorator
@cache_result()
def fetch_data_x(t_start: pd.Timestamp, t_end: pd.Timestamp):
print(f"Fetching data X between {t_start} and {t_end}")
return t_start, t_end
@cache_result()
def fetch_data_y(t_start: pd.Timestamp, t_end: pd.Timestamp):
print(f"Fetching data Y between {t_start} and {t_end}")
return t_start, t_end
fetch_data_x(pd.Timestamp('2020-01-01'), pd.Timestamp('2020-02-02'))
What is going on?¶
This took me quite a while to figure out. I read through a couple of in-depth description of how to write decorators, but was none the wiser.
In the end, I figured out this UnboundLocalError
has less to do with decorators than namespaces. In particular, we are allowed to reference a variable defined in an outer scope from an inner scope, but not to reassign it. More details follow:
Referencing variable defind in outer scope: OK¶
foo = "Hello "
def hello(x):
return foo+x
hello("world")
Assigning to variable already defind in outer scope: also fine. A new variable is created, variable in outer scope not modified.¶
foo = "Hello "
def hello(x):
foo = "Hi! "
return foo+x
hello("world")
hello("world")
print(f"foo is: {foo}")
All hell breaks loose if you do both: reference foo
and at the same time try to assign to it:¶
foo = "Hello "
def hello(x):
if foo == "Hello":
foo = "Hi"
return foo+x
hello("world")
Now the fixed decorator¶
def cache_result(cache_dir='./tmp'):
def decorator(func):
def wrapped(t_start, t_end):
cache_dir_path = Path(cache_dir)
function_details = f"{func.__code__.co_name}_{t_start.isoformat()}_{t_end.isoformat()}"
cache_filepath = cache_dir_path.joinpath(function_details)
cache_dir_path.mkdir(parents=True, exist_ok=True)
try:
print(f"Reading from cache: {cache_filepath}")
with open(cache_filepath, 'rb') as f:
res = pickle.load(f)
except:
print(f"Failed to read from cache: {cache_filepath}")
res = func(t_start, t_end)
print(f"Writing to cache: {cache_filepath}")
with open(cache_filepath, 'wb') as f:
pickle.dump(res, f)
return res
return wrapped
return decorator
@cache_result()
def fetch_data_x(t_start: pd.Timestamp, t_end: pd.Timestamp):
print(f"Fetching data X between {t_start} and {t_end}")
return t_start, t_end
@cache_result()
def fetch_data_y(t_start: pd.Timestamp, t_end: pd.Timestamp):
print(f"Fetching data Y between {t_start} and {t_end}")
return t_start, t_end
fetch_data_x(pd.Timestamp('2020-01-02'), pd.Timestamp('2020-02-02'))
Some useful articles on the topic: