Is anyone having success/seeing benefit with manag...
# general
r
Is anyone having success/seeing benefit with managing a repo of Airflow DAGs with Pants? Is there anything to look out for?
c
I'm very much interested in trying that <https://github.com/pantsbuild/pants/discussions/17858> but have not gotten far down the prototype yet.
r
Ok, thanks for the link. In my context we are deploying to GCP Cloud Composer, which is effectively Google’s managed Airflow much like AWS MWAA.
We’re also currently on Airflow 1. I’m thinking about starting to migrate individual DAGs over to Airflow 2.x, probably with a new repo that introduces Pants from the start.
c
I am working on airflow deployments to MWAA as well. I am trying to transition my team to use a monorepo for data-engineering. My current setup is: 1. Code organised in the similar way as pants repo. Basically
src/python/{dags,airflowplugins,libairflow,libspark,utils…}
.
dags
contains dags,
libairflow
contains operators, hooks and sensors. It also contains other modules like libspark and so on for other stuff. Tests are in their own top level folder. 2. I sync the whole
src/python
to
dags
folder on S3. This is specially needed on Airflow v2 as operators/sensors are not supported via plugins. Having them in
dags
folder is the recommendation from airflow. 3. Use an
airflowignore
file to ensure airflow only scans
dags
folder 4. We have 3 resolves. Code in
src/python/{dags,libairflow
depend on a
airflow-default
resolves, which is py3.7+airflow constraints file + requirements. Similarly we have
spark-default
which is py3.9+databricks LTS recommendations. Also have a
python-default
for stuff packed into docker containers. 5. Because all the airflow specific stuff is in a
airflow-requirements.txt
file, I simply upload it to MWAA. 6. The cool things about pants is that I can use its understanding of transitive dependencies to figure out which airflow DAG was impacted due to a PR, and run full suite of integration tests. This is specially important as testing airflow DAGs is time consuming. Basically my recommendation is to have all airflow specific requirements into its own resolves. This will ensure you have a
requirements.txt
specific for airflow. and pants ensures your airflow code only depends on the library versions compliant with the constraints file. https://www.pantsbuild.org/docs/python-third-party-dependencies#multiple-lockfiles