Workflow workspace access

While workflow inputs can be freely defined in the CWL definition, and the workflow itself can load and process data however required, including loading it from external sources, the Workflow Runner also provides some advanced functionality to load inputs from various sources more simply, including loading private workspace data that is hosted on the Hub.

The Workflow Runner provides functionality to load STAC Items in an additional STAGEIN step that is run ahead of any workflow steps, whenever a particular input type is used in the workflow. We will discuss this additional functionality later in this page.

The simplest example is just to pass in a public URL pointing to a STAC item as a Directory input, the Workflow Runner will call that URL and download the data as a STAC Catalog, provided it is public, ready for processing. This URL can of course be data taken directly from the DataHub Resource Catalog, for example this Sentinel2_ARD item could be used as an input.

However, if you wish to pass in data from your own workspace, for example when trying to process private data, the WR supports passing in data files from both S3 Object Stores and AWS Block Stores.

Workflow Runner Stage In

In order to make sure the STAGEIN step is run for your workflow, you need to define an input with the type set to `Directory`.

inputs:
  stac:
    label: the image to convert as a STAC item
    doc: the image to convert as a STAC item
    type: Directory

This will trigger the STAGEING step to run before your workflow executes, loading and saving the STAC Item locally in it's own STAC catalog.

The simplest example when using the Workflow Runner STAGEIN step is to define the STAC Item via a public URL. An example workflow can be found here. Note, the STAC Item must be available publicly (as JSON), as well as any assets within the item you wish to download for processing.

{
  "inputs": {
    "stac": "https://raw.githubusercontent.com/EOEPCA/convert/refs/heads/main/stac/eoepca-logo.json"
  }
}

In our first example we will pass in a public URL pointing to a STAC item as a Directory input, the Workflow Runner will make a request to that URL and download the data, constructing a new STAC Catalog, ready for processing in your workflow steps. The directory containing this new STAC Catalog will then be passed as an input to your workflow, which can then find and use the Catalog dataset.

stacItemFile="$(cat "${dir}/catalog.json" | jq -r '[.links[] | select(.rel == "item")][0].href')"

This URL can also be for data available publicly on the DataHub, for example this Sentinel2_ARD item could be used as an input in the same way.

However, if you wish to pass in data from your own workspace, for example when trying to process private data in your block or object stores, or in your private catalog within the Resource Catalogue, the Workflow Runner can also make authorized requests to these locations, without any additional user configuration.

Note, in the following guidance, when we refer to "your workspace" we mean the workspace you are using to execute the workflow, i.e. the one which you are authenticated as when calling the execution endpoint. This does not depend on who deployed the workflow, only who is executing it, should you be calling a public workflow.

Given the same `stac` example above, you could also pass in a STAC item from your Resource Catalog private catalog, which will be prefixed by `api/catalogue/stac/catalogs/user/catalogs/<workspace>`.

{
  "inputs": {
    "stac": "https://prod.eodatahub.org.uk/api/catalogue/stac/catalogs/user/catalogs/<workspace>/catalogs/processing-results/catalogs/path/to/stac/item"
  }
}

Or you can pass in data from your S3 Object Storage by providing an S3 URI, prefixed with the bucket name and workspace name, for example `s3://workspaces-eodhp-prod/<workspace>/...`.

{
  "inputs": {
    "stac": "s3://workspaces-eodhp-prod/<workspace>/item.json"
  }
}

You can also mount Object Store data via HTTPS, using the EODatahub workspaces domain.

{
  "inputs": {
    "stac": "https://<workspace>.prod.eodatahub-workspaces.org.uk/files/workspaces-eodhp-prod/path/to/item.json"
  }
}

You can also stage data directly from your Block Store, which is mounted to the STAGEIN pod as a file system. This data is mounted at the `/workspace/pv-<workspace>/` path, so you can access your data by providing a file path prefixed with such path.

{
  "inputs": {
    "stac": "/workspace/pv-<workspace>/path/to/item.json"
  }
}

Accessing Data within Workflows

We have discussed above how using the STAGEIN functionality of the Workflow Runner can allow you to mount private STAC data into your workflow for processing. However, you can also load private data directly within your workflows, without relying on the STAGEIN step and also allowing you to load non-STAC data.

Accessing S3 data within your workflows

The Workflow Runner automatically generates and provides credentials to access your S3 Object Store data within your workflow steps. These credentials are provided as additional inputs to your workflow, so you will need to import these as in this example.

Here we define additional AWS credential inputs, which must be defined identically, as these will be provided by the Workflow Runner, with the user not having to provide these in their inputs.

inputs:
  AWS_ACCESS_KEY_ID:
    label: aws access key
    type: string
  AWS_SECRET_ACCESS_KEY:
    label: aws secret access key
    type: string
  AWS_SESSION_TOKEN:
    label: aws session token
    type: string

Now that these inputs are part of your workflow, you can use these however you wish. We suggest setting these as environment variables within your workflow definition, to allow S3 clients to detect and load these credentials automatically. The below shows how to set these environment variables within the CWL script itself.

EnvVarRequirement:
  envDef:
    AWS_ACCESS_KEY_ID: $( inputs.AWS_ACCESS_KEY_ID)
    AWS_SECRET_ACCESS_KEY: $( inputs.AWS_SECRET_ACCESS_KEY)
    AWS_SESSION_TOKEN: $( inputs.AWS_SESSION_TOKEN)

This then means the variables can be loaded inside your code, for example in Python.

import os

# You can load the AWS credentials manually from the environment variables
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY= os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SESSION_TOKEN= os.getenv("AWS_SESSION_TOKEN")

# Or you can let the boto3 client find these automatically
import boto3

s3_client = boto3.Client("s3")

Accessing Workspace data within your workflows via HTTPS

The Workflow Runner automatically generates and provides credentials to access your workspace data within your workflow steps via HTTPS, either via access to S3 using HTTPS or through the Resource Catalogue API. These credentials are provided as additional inputs to your workflow, so you will need to import these as in this example.

This access is granted based on an additional workspace access token input, which is scoped to the workspace executing the workflow (whether executing a standard workflow or a user service). So first make sure your workflow accepts this additional input. The `WORKSPACE_ACCESS_TOKEN` is generated and provided by the Workflow Runner, there is no need for the user to provide this token as an input.

inputs:
  WORKSPACE_ACCESS_TOKEN:
    label: the workspace-scoped access token
    type: string

Just as with the AWS credentials we suggest you set this input as an environment variable so that you can then use it throughout your workflow when making authorized requests to the Hub.

EnvVarRequirement:
   envDef:
     WORKSPACE_ACCESS_TOKEN: $( inputs.WORKSPACE_ACCESS_TOKEN)

You can then load and pass this token in the authorization header as a Bearer token, when making HTTPS requests to the Hub that need to be authorized. For example when accessing private workspace data in your catalog in the Resource Catalogue. A full workflow and Python script to test this access is provided here.

import os
import requests

WORKSPACE_ACCESS_TOKEN = os.getenv("WORKSPACE_ACCESS_TOKEN", "your_api_token_here")

url = "https://prod.eodatahub.org.uk/api/catalogue/stac/catalogs/user/catalogs/<workspace>/path/to/item"
response = requests.get(url, headers={"Authorization": f"Bearer {WORKSPACE_ACCESS_TOKEN}"})

Accessing Workspace data within your workflows via Mounted Block Store

The Workflow Runner also mounts your workspace Block Store into each step of your workflow. This means you can access files in your Block Store just as if using local files within your workflow.

The Block Store is mounted at a specific file prefix, `/workspace/pv-<workspace>/...`, so you can access files here just as you would any other files in your workflow. This code first lists the files in the Block Store mount and then loads an example JSON file as a dictionary. You could add some logging statements to this example to print out the available files and file contents.

import os

workpace = "<workspace>"

block_store_dir = f"/workspace/pv-{workspace}"

files = os.listdir(block_store_dir)

if files:
  with open(f"{block_store_dir}/file.json", "r") as f:
    json_file = f.read()
    json_dict = json.loads(json_file)

Managing datasets in your workspace

To first manage datasets in your workspace Object and Block Stores you can use the Notebooks (Jupyter Notebooks) application on the Hub. This allows you to upload data and create directories to organize your data as you wish. Make sure you open the Notebook in the correct workspace, which you can select from the drop-down at the top when starting your server.

You are also able to harvest datasets into your catalog within the Resource Catalogue, using the data loader functionality, available in the left-side menu on the workspaces UI. This will allow you to load STAC directly into your catalog and access this via the Resource Catalogue API.

Once you have the files you want to use in your Stores or Resource Catalogue you can construct a workflow that accesses this data either:

Via STAGEIN when loading STAC Items for processing
Via workflow access using S3 credentials, HTTPS or Block Store mounts