Ask AI

You are viewing an unreleased or outdated version of the documentation

Understanding how asset definitions relate to ops and graphs#

Dagster’s main abstraction for building data pipelines is the asset definition. However, Dagster also has abstractions called ops and graphs.

If you're not sure which one to use, this guide is for you. In this guide, we'll cover:

  • When you should use assets or ops and graphs
  • How assets, ops, and graphs relate to each other

When should I use assets or ops and graphs?#

Dagster is mainly used to build data pipelines, and most data pipelines can be expressed in Dagster as sets of asset definitions. If you’re a new Dagster user and your goal is to build a data pipeline, we recommend starting with asset definitions and not worrying about ops or graphs. This is because most of the code you’ll be writing will directly relate to producing data assets.

However, there are some situations where you want to run code without thinking about data assets that the code is producing. In these cases, it’s appropriate to use ops and graphs. For example:

Situation 1: You’re not building a data pipeline#

You want to schedule a workflow where the goal is not to keep a set of data assets up-to-date. It might do something like:

  • Send emails to a set of users
  • Scan a data warehouse for tables that haven't been used in months and delete them
  • Record metadata about a set of data assets

In these cases, you should define your workflow in terms ops and graphs, not asset definitions. The Intro to ops and jobs guide is a good place to start learning how to do this.

Additionally, note that a single Dagster deployment can contain asset definitions and op/graph-based jobs side-by-side, which means that you’re not bound to one particular choice. If your workflow reads from asset definitions, you can model that explicitly in Dagster, which is discussed in a a later section.

Situation 2: You want to break an asset into multiple steps#

If you're in a situation like the following:

  • An asset requires multiple steps
  • Some of the steps don't produce assets of their own
  • You need to be able to re-execute individual steps

In this case, you might want to use a graph-backed asset. This is discussed more later in this guide.

Situation 3: You’re anchored in task-based workflows#

Task-based workflows have been a popular way of defining data pipelines for a long time. While we believe that asset definitions provide a superior way of writing and operating data pipelines, we acknowledge that teams often have existing codebases or mindsets that are heavily anchored in task-based workflows.

Op-based graphs resemble task-based workflows very closely, so they’re a natural choice for data pipelines that want to stick to that paradigm, either permanently or temporarily, while migrating to asset definitions.


How do assets relate to ops and graphs?#

Next, we'll discuss how assets relate to ops and graphs. By the end of this section, you should understand how each type of asset definition relates to ops and graphs.

Basic asset definitions#

An asset definitions is a description of how to compute the contents of a particular data asset.

Under the hood, every asset definition contains an op (or graph of ops), which is the function that’s invoked to compute its contents. In most cases, the underlying op is invisible to the user.

Op and basic asset definition

Multi-asset definitions#

When you use the @multi_asset decorator, you’re defining a single op that produces multiple assets:

Multi- asset definition

Graph-backed asset definitions#

Dagster supports composing a set of ops into an op graph, usually by using the @graph decorator. An asset definition can be backed by an op graph, instead of an op.

Op graph and graph-backed asset

Graph-backed assets are useful when you want to execute multiple separate steps to compute an asset and some of those steps don’t produce assets of their own.

For example, to compute the contents of a table, you need to fetch data from an API and then perform a heavy data transformation on it. You don’t care about writing the fetched, pre-transformed data to any known location, but you want the fetching and transforming to happen in two separate steps that can run in different processes. If there’s a failure, you’d like to be able to re-execute the transformation step without re-executing the fetching step.

Refer to the Graph-backed asset documentation for code examples and details on how to define and use graph-backed assets.

Op graphs that read from an asset#

In some cases, you might want to build a job that doesn't produce any assets, but does read from at least one asset. Dagster facilitates this by allowing you to designate assets as inputs to ops within a graph or graph-based job:

Op graph with source asset

For example, you have a table that represents a list of emails that you want to send. A job reads data from the table and uses it to send the emails:

from dagster import asset, job, op


@asset
def emails_to_send(): ...


@op
def send_emails(emails) -> None: ...


@job
def send_emails_job():
    send_emails(emails_to_send.get_asset_spec())

In this case, the asset - specifically, the table the job reads from - is only used as a data source for the job. It’s not materialized when the graph is run.

The Graph documentation contains more details on how this works.