Creating Workflows
Objective
Demonstrate how to use the wf_spec submodule to create Argo Workflow resources.
Prerequisites
You should ensure you have completed the following guides before continuing:
- Installing the Argo Workflow Engine
- Developing First-party IaC Modules
- Argo's Getting Started Guide for Argo Workflows
Background
While Argo Workflows contain incredible power and flexibility, these benefits come at the cost of increased complexity compared to simpler but less robust tools. Building a Workflow from scratch can be a daunting endeavor due to the amount of configuration knobs available.
We have created a submodule, wf_spec, that aims to simplify the process by applying sane defaults that are compatible with the Panfactum Stack architecture.
wf is shorthand for "workflow," but why spec?
Well, a Workflow only defines a single run of the workflow execution graph. This is not very useful when you want to run the same execution graph many times. Thus, instead of creating Workflows directly, Workflows are typically created via Workflow Templates or Cron Workflows. 1
wf_spec
creates the "specification" of the Workflow that is added to a resource such as a Workflow Template.
We show a concrete example below.
The Basics
This is a very simple example workflow that demonstrates the basics of using our wf_spec
submodule: 2
module "workflow_spec" {
source = "${var.pf_module_source}wf_spec${var.pf_module_ref}"
name = "example"
namespace = local.namespace
eks_cluster_name = var.eks_cluster_name
entrypoint = "entry"
templates = [
{
name = "entry"
dag = {
tasks = [
{
name = "first-task"
template = "first-template"
},
{
name = "second-task"
template = "second-template"
dependencies = [ "first-task" ]
}
]
}
},
{
name = "first-template"
container = {
image = "docker.io/library/busybox:stable"
command = ["/bin/sh", "-c", "echo first-template"]
}
},
{
name = "second-template"
container = {
image = "docker.io/library/busybox:stable"
command = ["/bin/sh", "-c", "echo second-template"]
}
tolerations = concat(module.workflow_spec.tolerations, [
{ key = "foo", operator = "Exists", effect = "NoSchedule" }
])
}
]
}
resource "kubectl_manifest" "workflow_template" {
yaml_body = yamlencode({
apiVersion = "argoproj.io/v1alpha1"
kind = "WorkflowTemplate"
metadata = {
name = "example"
namespace = local.namespace
labels = module.workflow_spec.labels
}
spec = module.workflow_spec.workflow_spec
})
server_side_apply = true
force_conflicts = true
}
A few critical points to notice:
-
In the example,
module.workflow_spec
instantiates the Panfactumwf_spec
module, but critically this module does not create a Workflow; it only creates the specification for a workflow.However, the above example does create a WorkflowTemplate resource via
kubectl_manifest.workflow_template
. This WorkflowTemplate uses theworkflow_spec
output of themodule.workflow_spec
module for it'sspec
field. -
The
templates
array defines all the nodes in the workflow's execution graph. There are many different types of Templates, but the most common are:- A Container Template (for running a single container in a Pod)
- A DAG Template (for defining an execution graph of other templates)
- A Container Set Template (for defining an execution graph of containers within a single Pod)
- A Kubernetes Resource Template (for performing CRUD operations against Kubernetes resources)
- A Suspend Template (for suspending a Workflow)
-
You might want to override one or more pod-level defaults for a template (example: provide certain tolerations for some pods but not others). See the
second-template
template above as an example.Note that OpenTofu supports self-references (i.e., using a module's outputs as a part of its inputs) since values are lazily evaluated. As a result, we have given the
wf_spec
module outputs such astolerations
,volumes
,env
, etc., that can be used as building blocks for overrides.Finally, note that any overrides are shallow-merged by key, so if
concat
had not been used in the example, the default tolerations would have been dropped. -
Defining the execution graph of a Workflow is typically done via a DAG (Directed Acyclic Graph) template. This template references other templates and defines the order and conditions in which to execute them.
-
The
entrypoint
defines which template from thetemplates
array will be initially executed when the Workflow is created.Our example above starts by executing the DAG template (
entry
) which in turns will run thefirst-template
Container template and then thesecond-template
Container template.
Field Casing in Templates
When defining templates for your workflow spec, you must use camelCase
instead of snake_case as we pass through the templates
input directly
to the generated Kubernetes manifest (and Kubernetes expects camelCase).
Resource Requests and Limits
The VPA unfortunately does not work for pods generated by Workflows. As a result, you must specifically set the resource requests and limits for each container.
You can also use the default_resources
input of wf_spec
to set a sensible default for all containers.
Advanced Workflow Patterns
For more details and examples on common workflow patterns that you might wish to deploy, see the documentation for wf_spec.
Workflow Checklist
Below is a set of questions that we recommend asking when you create a Workflow using wf_spec
:
- Can your pods run on
arm64
nodes? If not, did you setarm_nodes_enabled
tofalse
? 3 - Can your pods tolerate disruptions? If not, set
disruptions_enabled
tofalse
to avoid all voluntary disruptions. Additionally, you might want to setspot_nodes_enabled
tofalse
to avoid the (small) chance of spot node preemption. 4 - Are your pods a good fit for burstable spot (T-family) instances? If so, set
burstable_nodes_enabled
totrue
to save additional money. - How many times do you want each node in the workflow to retry on failure? Have you set
retry_max_attempts
to the appropriate value? Have you tuned the retry backoff parameters,retry_backoff_initial_duration_seconds
andretry_backoff_max_duration_seconds
? - Can multiple instances of this workflow be running at once? If so, have you increased
workflow_parallelism
? - Do you want limit the number of active pods that can be running at a single time in a workflow? If so, have you set
pod_parallelism
? - Have you set the appropriate resource requests and limits for all your containers?
- Have you set an appropriate timeout for the workflow via
active_deadline_seconds
? - Do the pods in the workflow need to communicate with the AWS API? Have you added the necessary
permissions via
extra_aws_permissions
? - Do the pods in the workflow need to communicate with the Kubernetes API? Have you added the necessary
permissions to the ServiceAccount (referenced via the
wf_spec.service_account_name
output)? - Do your workflow containers need to write to disk? If so, have you set
tmp_directories
? - Do your workflow containers need access to a common set of environment variables? If so, have you set
common_env
and/orcommon_secrets
? - Do your workflow containers need access to mounted Secrets or ConfigMaps? If so, have you set
config_map_mounts
and/orsecret_mounts
? - Can your workflows containers run as user
1000
? If not, have you explicitly setuid
? If the containers need to run as root, have you setrun_as_root
totrue
? - Do you need to parameterize the Workflow? If so, did you add
arguments
?
Examples
- The Argo project provides 100+ examples of various Workflow patterns.
- You provide many prebuilt WorkflowTemplates that you can examine for best practices.
Footnotes
-
Similar to how Pods are created and managed by a higher level resource like a Deployment. ↩
-
This example obviously omits some basic module setup such as deploying a
kube_namespace
and defining input variables. ↩ -
Running on
arm64
nodes will save about 20% of your workflow costs, so it might be worth it to create a multi-platform image. See our guide. ↩ -
Note that both of these will significantly increase the cost of running this workflow. If the workflow runs very often, you may want to architect your workflow to be able to tolerate the occasional disruption so you at least keep
spot_nodes_enabled
astrue
. ↩