Skip to content

October 4, 2024

High-level DAG compilation design

To compile a DAG representing a data processing pipeline from a set of configuration files, we need to perform three steps at the highest level (at least I think it's the highest level):

  1. Read the configuration files

  2. Clean & validate the configuration files' data

  3. Construct the DAG from that data

Here's a figure of the workflow. The config file paths are read, their data returned as dicts. After validation and cleaning, the data is returned as instances of custom classes. Finally, these classes are inputs to constructing the DAG. Figure showing the three step process described above, and their inputs and outputs

Note that this seems to only be the workflow for just one package's configuration files. But an important part of this whole process is managing dependencies of other projects. For that, we need a mechanism to list and install dependencies. The pyproject.toml file is Python's modern mechanism to do exactly that. It replaces older mechanisms using setuptools.

The problem this package needs to solve is that while the pyproject.toml file is readily available to the person developing the package, it is not typically installed along with the package when someone else pip install's it. How can I resolve this?

Option 1: Include the pyproject.toml file in the distributed package.

To include the pyproject.toml file in the distributed package, modify the pyproject.toml file to contain the following:

[tool.hatch.build]
include = [
    "pyproject.toml",
    "src/package_dag_compiler/**"
]
The idea is that I could loop through each subfolder in the virtual environment folder, checking for and reading each folder's pyproject.toml files to find all of the dependencies for a specific package. But I think Python and package versioning presents a problem, unless I want to loop through all of the folders.

Side detour: Learning about virtual environment folder structure

Using Python's venv module, the (abbreviated) folder structure is:

venv/
    bin/
        activate
    include/
    lib/
        python3.X/
            site-packages/
                package_name/
    pyvenv.cfg

Looking at the natsort Python package, I am learning a bit about how these packages are typically structured. The package is installed in the site-packages folder, and the package's source code is in the natsort folder. The natsort folder contains the __init__.py file, which is the package's entry point for import. The natsort folder also contains the __main__.py file, which is the package's entry point when running the package as a script. Looking at my own package, the package structure is root/src/package_name, but when installed, the package structure is root/site-packages/package_name.

Therefore, in the index.toml files, the relative file paths should be relative to the src/package_name folder, because that's the root folder when the package is installed. Therefore, the pyproject.toml may not actually need to have a path to the index.toml at all! I'll just assume that the index.toml is located in src/package_name.

But then the question is, without the inclusion of the pyproject.toml file, how do I know what the dependencies are? I think I need to include the pyproject.toml file in the distributed package instead of making people list dependencies in a separate file.

Maybe the better format is just for the index.toml file to contain the dependencies AND the paths to the configuration files? That way, the pyproject.toml file is not needed at all.

So, the index.toml file should contain the following:

dependencies = [] # Indicates that there are no dependencies
bridges = [] # Indicates that there are no bridges
paths = [
    "config_file1.toml",
    "config_file2.toml"
]

It also appears that whether or not I include the following dictates whether the package's sdist includes the "src" folder. E.g. if the below is included in the pyproject.toml, and I have "src/package_name/test.py", then when the sdist is created, it'll be "site-packages/package_name/test.py". If the below is not included, then the sdist and project directory folders will exactly match. It seems that the contents of this line are reflected in the dist/package_name.version.tar.gz file, but not in the installed package.

[tool.hatch.build]
include = [
    "src/package_dag_compiler/**"
]

I am also learning a valuable lesson about the difference between sdists (compressed .py files) and wheels (compiled binaries of the package's code). From the Python Packaging User Guide:

  1. Do I need both sdists and wheels?

    • "For pure-Python packages, the difference between sdists and wheels is less marked. There is normally one single wheel, for all platforms and Python versions. Python is an interpreted language, which does not need ahead-of-time compilation, so wheels contain .py files just like sdists."
  2. What's in a wheel?

    • "With that being said, there are still important differences between sdists and wheels, even for pure Python projects. Wheels are meant to contain exactly what is to be installed, and nothing more. In particular, wheels should never include tests and documentation, while sdists commonly do."
  3. Are wheels needed for pure Python projects?

    • "At a glance, you might wonder if wheels are really needed for “plain and basic” pure Python projects. Keep in mind that due to the flexibility of sdists, installers like pip cannot install from sdists directly – they need to first build a wheel, by invoking the build backend that the sdist specifies (the build backend may do all sorts of transformations while building the wheel, such as compiling C extensions). For this reason, even for a pure Python project, you should always upload both an sdist and a wheel to PyPI or other package indices. This makes installation much faster for your users, since a wheel is directly installable. By only including files that must be installed, wheels also make for smaller downloads."

Editable installs

This detour has turned into a deep dive on Python packaging, which is helpful. I discovered that essentially, by default the /src/project_name folder is treated as the root folder when the package is built and when it is installed, with the exception that the sdist build has all files if not specified.

The editable installs work by putting a direct_urls.json file in the site-packages/package_name folder. Sometimes it feels finicky, switching between editable and non-editable installs, but it seems to work.