> ## Documentation Index
> Fetch the complete documentation index at: https://docs.oleander.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Jobs

> Run and manage your Spark artifacts and jobs with oleander

Run your Spark applications on oleander-managed infrastructure or on your own registered Spark clusters. Upload artifacts to oleander for the managed cluster, or keep jobs in your environment for registered clusters. Manage runs and capture lineage metadata for full observability of your data transformations.

## Installation

<CodeGroup>
  ```bash Homebrew theme={null}
  brew tap OleanderHQ/tap
  brew install oleander-cli
  ```

  ```bash Ubuntu (APT) theme={null}
  sudo apt-get update
  sudo apt-get install -y curl ca-certificates gnupg

  curl -fsSL https://oleander-cli-releases.s3.amazonaws.com/keys/oleander-archive-keyring.gpg \
    | sudo tee /usr/share/keyrings/oleander-archive-keyring.gpg >/dev/null

  sudo tee /etc/apt/sources.list.d/oleander.sources >/dev/null <<'EOF'
  Types: deb
  URIs: https://oleander-cli-releases.s3.amazonaws.com/apt
  Suites: stable
  Components: main
  Signed-By: /usr/share/keyrings/oleander-archive-keyring.gpg
  EOF

  sudo apt-get update
  sudo apt-get install -y oleander-cli
  ```
</CodeGroup>

## Configuration

Authenticate with the API key from your [oleander settings](https://oleander.dev/app/settings/api-keys).

```bash theme={null}
oleander configure --api-key <YOUR_API_KEY>
```

## Oleander Managed Spark

Upload, list, and delete artifacts only on the oleander-managed cluster.

### Initialize a PySpark workspace

Create a new PySpark job workspace:

```bash theme={null}
oleander spark init <dirname>
```

**Example:**

```bash theme={null}
oleander spark init my-job
```

You can also initialize the current directory:

```bash theme={null}
oleander spark init .
```

The initialized workspace includes:

* `entrypoint.py` as the Spark job entrypoint
* `mylib/` for Python modules packaged as `pyFiles`
* `pyproject.toml` and `uv.lock` for project and dependency management with `uv`
* `Makefile` targets for building deployable artifacts

Use `uv` to manage dependencies:

```bash theme={null}
uv sync --dev
uv add <package>
uv add --dev <package>
```

Use `make` to build the deployment artifacts:

```bash theme={null}
make
```

This builds:

* `out/pyfiles.zip`
* `out/environment.tar.gz`

You can also build individual artifacts:

```bash theme={null}
make pyfiles
make environment
make rebuild
```

After building, upload and submit from the initialized workspace:

```bash theme={null}
oleander spark jobs upload entrypoint.py \
  --py-files out/pyfiles.zip \
  --virtualenv out/environment.tar.gz
```

```bash theme={null}
oleander spark jobs submit entrypoint.py \
  --namespace <namespace> \
  --name <job-name> \
  --wait
```

### List your Spark artifacts

List your uploaded Spark artifacts:

```bash theme={null}
oleander spark jobs list
```

This lists all Spark artifacts available to run.

### Upload your Spark artifact

Upload a local `.py` or `.jar` artifact to oleander:

```bash theme={null}
oleander spark jobs upload <your_artifact_path>
```

**Example:**

```bash theme={null}
oleander spark jobs upload ./transformations/process_sales_data.py
```

**JAR example:**

```bash theme={null}
oleander spark jobs upload ./jobs/process_sales_data.jar
```

Every upload creates a new artifact version on the backend.

#### Include Python dependencies

If your Python artifact needs additional Python modules, package them in a ZIP and include them with `--py-files`:

```bash theme={null}
oleander spark jobs upload <your_artifact_path> --py-files <module_archive_zip>
```

**Example:**

```bash theme={null}
oleander spark jobs upload ./etl_pipeline.py --py-files ./dependencies.zip
```

#### Include a virtual environment

If your Python artifact depends on a packaged virtual environment, include it with `--virtualenv`:

```bash theme={null}
oleander spark jobs upload <your_artifact_path> --virtualenv <virtualenv_archive>
```

**Example:**

```bash theme={null}
oleander spark jobs upload ./etl_pipeline.py --virtualenv ./venv.tar.gz
```

### Delete a Spark artifact

Delete a Spark artifact:

```bash theme={null}
oleander spark jobs delete <artifact_name>
```

**Example:**

```bash theme={null}
oleander spark jobs delete process_sales_data.py
```

Use the exact uploaded artifact name, including the file extension.

### Submit and execute a Spark job

Submit your uploaded artifact to the oleander-managed cluster. Use the exact uploaded filename without the path, such as `process_sales_data.py` or `analytics-batch.jar`. The `--wait` flag keeps the command running until the job finishes.

```bash theme={null}
oleander spark jobs submit <entrypoint> --namespace <namespace> --name <run_name> --wait
```

**Example:**

```bash theme={null}
oleander spark jobs submit process_sales_data.py --namespace finance --name process-sales-data --wait
```

### Common submit options

* `--cluster`: Cluster name. Defaults to the oleander-managed cluster when omitted.
* `--namespace` (required): Namespace for the job, a logical group such as a team or project.
* `--name` (required): Job name. Runs with the same namespace and name are grouped under the same job.
* `--args`: Spark job entrypoint arguments.
* `--sparkConf`: Spark configurations without `--conf`, for example `spark.default.parallelism=8`. Separate multiple configurations with whitespace.
* `--packages`: Extra package coordinates.
* `--jobTags`: Job-specific tags in `key=value` form. Separate multiple tags with whitespace.
* `--runTags`: Run-specific tags.
* `--wait`: Wait until the job finishes.

### Oleander-managed submit options

* `--driverMachineType`: oleander Spark driver machine type.
* `--executorMachineType`: oleander Spark executor machine type.
* `--executorNumbers`: Number of executor instances.

## Registered EMR Serverless Spark

Register your EMR Serverless cluster and target it by name when submitting jobs. Include `--cluster <name>` and provide the S3 entrypoint to a `.py` or `.jar` artifact.

### Register an EMR Serverless cluster

```bash theme={null}
oleander spark clusters register <name> \
  --type emr-serverless \
  --region <region> \
  --account-id <awsAccountId> \
  --controller-role-arn <controllerRoleArn> \
  --execution-role-arn <executionRoleArn> \
  --application-id <applicationId> \
  --log-bucket <logBucket>
```

### Register options

* `--region`: AWS region of the EMR Serverless application.
* `--account-id`: AWS account ID of the EMR Serverless application.
* `--controller-role-arn`: IAM role ARN oleander assumes to start job runs. Add this to the role's trust policy so oleander can assume it:

```json theme={null}
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::579897423473:root"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "oleander"
                }
            }
        }
    ]
}
```

Add this permissions policy to the controller role so oleander can run the job:

```json theme={null}
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowStartJobRun",
            "Effect": "Allow",
            "Action": "emr-serverless:StartJobRun",
            "Resource": "arn:aws:emr-serverless:<REGION>:<ACCOUNT_ID>:/applications/<APPLICATION_ID>"
        },
        {
            "Sid": "AllowGetJobRun",
            "Effect": "Allow",
            "Action": "emr-serverless:GetJobRun",
            "Resource": "arn:aws:emr-serverless:<REGION>:<ACCOUNT_ID>:/applications/<APPLICATION_ID>/jobruns/*"
        },
        {
            "Sid": "PassExecutionRole",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::<ACCOUNT_ID>:role/<JOB_EXECUTION_ROLE_NAME>"
        },
        {
            "Sid": "ReadLogFromS3",
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::<LOG_BUCKET_NAME>/*"
        }
    ]
}
```

* `--execution-role-arn`: IAM role ARN the job uses; the Spark application runs with this role's permissions.
* `--application-id`: EMR Serverless application ID.
* `--log-bucket`: S3 bucket for job logs.

### Submit a job to EMR Serverless

```bash theme={null}
oleander spark jobs submit <entrypoint_s3_uri> --cluster <cluster_name> --namespace <namespace> --name <run_name> --wait
```

**Example:**

```bash theme={null}
oleander spark jobs submit s3://my-bucket/jobs/process_sales_data.py --cluster my-emr --namespace finance --name process-sales-data --wait
```

### Submit options

* `--cluster` (required): Name of the registered cluster.
* `--namespace` (required): Namespace for the job, a logical group such as a team or project.
* `--name` (required): Job name. Runs with the same namespace and name are grouped under the same job.
* `--args`: Spark job entrypoint arguments.
* `--sparkConf`: Spark configurations without `--conf`, for example `spark.default.parallelism=8`. Separate multiple configurations with whitespace.
* `--packages`: Extra package coordinates.
* `--jobTags`: Job-specific tags in `key=value` form. Separate multiple tags with whitespace.
* `--runTags`: Run-specific tags.
* `--executionIamPolicy`: IAM policy for job permissions. Final permissions are the intersection of the job execution role and this policy.
* `--pyFiles`: Extra `pyFiles` for the PySpark job. Mutually exclusive with `--mainClass`.
* `--virtualenv`: Virtual environment archive for Python jobs.
* `--mainClass`: Entrypoint main class for the Java/Scala Spark job. Use instead of Python-specific options such as `--pyFiles` and `--virtualenv`.
* `--wait`: Wait until the job finishes.

## Registered Glue Spark

Register your Glue cluster and target it by name when submitting jobs. Include `--cluster <name>`. Submit uses the existing Glue job name in your environment.

### Register a Glue cluster

```bash theme={null}
oleander spark clusters register <name> \
  --type glue \
  --controller-role-arn <controllerRoleArn>
```

### Register options

* `--controller-role-arn`: IAM role ARN oleander assumes to start job runs. Add this to the role's trust policy so oleander can assume it:

```json theme={null}
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::579897423473:root"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "oleander"
                }
            }
        }
    ]
}
```

Add this permissions policy to the controller role so oleander can run the job:

```json theme={null}
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowStartAndJobRun",
            "Effect": "Allow",
            "Action": [
                "glue:StartJobRun",
                "glue:GetJobRun"
            ],
            "Resource": "arn:aws:glue:<REGION>:<ACCOUNT_ID>:job/*"
        },
        {
            "Sid": "PassExecutionRole",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::<ACCOUNT_ID>:role/<JOB_EXECUTION_ROLE_NAME>"
        },
        {
            "Sid": "ReadGlueLogs",
            "Effect": "Allow",
            "Action": [
                "logs:DescribeLogStreams"
            ],
            "Resource": "arn:aws:logs:<REGION>:<ACCOUNT_ID>:log-group:/aws-glue/jobs/output"
        },
        {
            "Sid": "CloudWatchLogsGetLogEvents",
            "Effect": "Allow",
            "Action": [
                "logs:GetLogEvents"
            ],
            "Resource": "arn:aws:logs:<REGION>:<ACCOUNT_ID>:log-group:/aws-glue/jobs/output:log-stream:*"
        }
    ]
}
```

### Submit a job to Glue

Use `--cluster` to select the registered cluster:

```bash theme={null}
oleander spark jobs submit <job_name> --cluster <cluster_name> --namespace <namespace> --name <run_name> --wait
```

**Example:**

```bash theme={null}
oleander spark jobs submit process-sales-data --cluster my-glue --namespace finance --name process-sales-data --wait
```

### Submit options

* `--cluster` (required): Name of the registered cluster.
* `--namespace` (required): Namespace for the job, a logical group such as a team or project.
* `--name` (required): Job name. Runs with the same namespace and name are grouped under the same job.
* `--args`: Spark job entrypoint arguments.
* `--sparkConf`: Spark configurations without `--conf`, for example `spark.default.parallelism=8`. Separate multiple configurations with whitespace.
* `--packages`: Extra package coordinates.
* `--jobTags`: Job-specific tags in `key=value` form. Separate multiple tags with whitespace.
* `--runTags`: Run-specific tags.
* `--executionIamPolicy`: IAM policy for job permissions. Final permissions are the intersection of the job execution role and this policy.
* `--workerType`: Glue worker type.
* `--numberOfWorkers`: Number of Glue workers.
* `--enableAutoScaling`: Set to `true` for auto scaling, `false` otherwise.
* `--executionClass`: Glue execution class. Either `STANDARD` or `FLEX`.
* `--timeoutMinutes`: Glue job timeout in minutes.
* `--wait`: Wait until the job finishes.

When your Spark job runs, oleander captures OpenLineage metadata for lineage and dependencies. View results and the lineage graph in your [oleander dashboard](https://oleander.dev/app/pipelines).
