Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Massive slowdown of DAG File processor due to JSON schema upgrade in Airflow's core. #28059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
GeorgiosNikolopoulos opened this issue Dec 2, 2022 · 3 comments
Closed
2 tasks done
Labels
area:core kind:bug This is a clearly a bug

Comments

@GeorgiosNikolopoulos
Copy link

GeorgiosNikolopoulos commented Dec 2, 2022

Apache Airflow version

2.4.3

What happened

We recently updated our dev environment from airflow 2.2.5 (python 3.9) to 2.4.3 (python 3.10). For our workloads, we use a DAG Factory that parses JSONS and converts them into DAGs. In airflow 2.2.5, the DAG Factory needed approximately 30-140 seconds to generate 100 DAGs. In airflow 2.4.3, the same Dags required considerably more time to load (from 3x to over 5x at some tests).

We investigated by using scalene (python profiler) by running the DagFileProcess directly and discovered the following:
Screenshot 2022-12-02 114824

A huge percentage of CPU time was spent on json validation at line 91 of models/params. Indeed, our Factory does generate quite a few params per DAG it creates, so it would make sense for it to need some time to validate all of them per DAG. However, upgrading airflow shouldn't result in such a big, flat increase in parsing time, and we figured that jsonschema was the probable issue.

To verify that the JSON validation was the reason for the increase, we checked airflow's dependencies and found out that in the official image for 2.4.3 jsonschema version 3.2.0 is used, in airflow 2.2.5, jsonschema 4.17.3 is used instead.

As a final test, we uninstalled jsonschema version 4.17.3 from our image and replaced it with 3.2.0. The DAG Factory immediately run as expected, taking approximately 30 seconds to load 100 DAGs when the cluster was under little load, or about 100-140 when the cluster was under heavy load.

Example logs:
Version 2.4.3:
{{processor.py:176}} INFO - Processing /opt/airflow/dags/{other_folders}/{file_name}.py took 125.556 seconds

Version 2.2.5:
{{processor.py:249}} WARNING - Killing DAGFileProcessorProcess (PID=5943)
This occured constantly with a timeout setting of 300 seconds

What you think should happen instead

Airflow should require the same time to parse 100 DAGs in both versions.

How to reproduce

Create a DAG with many params (ideally over 20-30, the more the better), using mainly string, integer and nested dict types. Check how long it takes to load in airflow 2.2.5. Then use airflow 2.4.3. There should be a noticeable difference in loading times (at least 3x).

Operating System

Debian GNU Linux 11

Versions of Apache Airflow Providers

No relevant providers used

Deployment

Other Docker-based deployment

Deployment details

We use an AKS cluster in combination with a customised Docker image stemming from the official full docker image (not slim).

Anything else

This problem may be very noticable for us and our deployment due to the way we build DAGs (many params), but it should impact all DAG generation where params are used.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@GeorgiosNikolopoulos GeorgiosNikolopoulos added area:core kind:bug This is a clearly a bug labels Dec 2, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Dec 2, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@potiuk
Copy link
Member

potiuk commented Dec 5, 2022

Thanks for detailed report. They had similar issue reported before as 4.0.1 was degrading performance where refs wer used: python-jsonschema/jsonschema#853 but apparently they solved it in 4.3.1.

Could you please provide some examples of generated jsonschema to validate and open an issue in jsonschema repository detailing it - since you have all the test scenarios, we will not know if the problem will be fixed. They seem to react very fast and fix such problems - and likely they will need additional informatoin and iterations, so it makes sense that you will open such issue (you can refer to this one and even copy the content - but maybe provide more info on the validated content/

Also before maybe you should try 4.3.1 and see if it solves the problem - maybe the latest version in has some regression. And maybe then it will be easy for you to bisect the version that causes the issue?

@potiuk
Copy link
Member

potiuk commented Dec 18, 2022

Since we have not heard from the user - converting it into a discussion. Shall there be more information/data provided we can consider what to do with it.

@apache apache locked and limited conversation to collaborators Dec 18, 2022
@potiuk potiuk converted this issue into discussion #28445 Dec 18, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
area:core kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

2 participants