Creating New Packages
For any data science project, or when building a data processing pipeline component, first you need to initialize your project.
-
Create a new directory for your project.
mkdir $project_folder
-
Create a new virtual environment in the project directory and activate it.
Warning
For now, the virtual environment MUST be named .venv
to work with the dagpiler package.
cd $project_folder
python -m venv .venv
source .venv/bin/activate # Linux and MacOS
.venv\Scripts\activate # Windows
-
Install the dagpiler package using pip
pip install dagpiler
-
Initialize the project with the
dagpiler init
command. This creates the folder structure and files needed for the project.It will ask you for the following metadata to set up the pyproject.toml and mkdocs.yml files:dagpiler init
name
: The name of the package (REQUIRED)author name
: The name of the author (OPTIONAL)author email
: The email of the author (OPTIONAL)
Tip
If you don't want to provide any metadata here (including the package name), you can provide it later in the pyproject.toml and mkdocs.yml files manually. This step is inteded only to save you time later.
-
Write the TOML files that define your data processing pipeline components. For more information on the types of TOML files, see the Types of TOML Files page.
-
Compile the TOML files into a Directed Acyclic Graph (DAG) as a NetworkX MultiDiGraph object using the
dagpiler compile
command line command. This command will check the TOML files for errors and compile them into the DAG.
Warning
No matter how you run the compile
command, the package name must match a package that has been pip installed in the current folder's virtual environment (.venv).
To run the command from Python:
from dagpiler import compile_dag
package_name = "my_package"
dag = compile_dag(package_name)