Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.oleander.dev/llms.txt

Use this file to discover all available pages before exploring further.

Run your Spark applications on oleander-managed infrastructure or on your own registered Spark clusters. Upload artifacts to oleander for the managed cluster, or keep jobs in your environment for registered clusters. Manage runs and capture lineage metadata for full observability of your data transformations.

Installation

Using Homebrew

Install the oleander CLI:
brew tap OleanderHQ/tap
brew install oleander-cli
Upgrade the oleander CLI:
brew update
brew upgrade oleander-cli

Configuration

Authenticate with your API key. Find it in your oleander settings.
oleander configure --api-key <YOUR_API_KEY>

Oleander Managed Spark

Upload, list, and delete artifacts only on the oleander-managed cluster.

Initialize a PySpark workspace

Create a new PySpark job workspace:
oleander spark init <dirname>
Example:
oleander spark init my-job
You can also initialize the current directory:
oleander spark init .
The initialized workspace includes:
  • entrypoint.py as the Spark job entrypoint
  • mylib/ for Python modules packaged as pyFiles
  • pyproject.toml and uv.lock for project and dependency management with uv
  • Makefile targets for building deployable artifacts
Use uv to manage dependencies:
uv sync --dev
uv add <package>
uv add --dev <package>
Use make to build the deployment artifacts:
make
This builds:
  • out/pyfiles.zip
  • out/environment.tar.gz
You can also build individual artifacts:
make pyfiles
make environment
make rebuild
After building, upload and submit from the initialized workspace:
oleander spark jobs upload entrypoint.py \
  --py-files out/pyfiles.zip \
  --virtualenv out/environment.tar.gz
oleander spark jobs submit entrypoint.py \
  --namespace <namespace> \
  --name <job-name> \
  --wait

List your Spark artifacts

List your uploaded Spark artifacts:
oleander spark jobs list
This lists all Spark artifacts available to run.

Upload your Spark artifact

Upload a local .py or .jar artifact to oleander:
oleander spark jobs upload <your_artifact_path>
Example:
oleander spark jobs upload ./transformations/process_sales_data.py
JAR example:
oleander spark jobs upload ./jobs/process_sales_data.jar
Every upload creates a new artifact version on the backend.

Include Python dependencies

If your Python artifact needs additional Python modules, package them in a ZIP and include them with --py-files:
oleander spark jobs upload <your_artifact_path> --py-files <module_archive_zip>
Example:
oleander spark jobs upload ./etl_pipeline.py --py-files ./dependencies.zip

Include a virtual environment

If your Python artifact depends on a packaged virtual environment, include it with --virtualenv:
oleander spark jobs upload <your_artifact_path> --virtualenv <virtualenv_archive>
Example:
oleander spark jobs upload ./etl_pipeline.py --virtualenv ./venv.tar.gz

Delete a Spark artifact

Delete a Spark artifact:
oleander spark jobs delete <artifact_name>
Example:
oleander spark jobs delete process_sales_data.py
Use the exact uploaded artifact name, including the file extension.

Submit and execute a Spark job

Submit your uploaded artifact to the oleander-managed cluster. Use the exact uploaded filename without the path, such as process_sales_data.py or analytics-batch.jar. The --wait flag keeps the command running until the job finishes.
oleander spark jobs submit <entrypoint> --namespace <namespace> --name <run_name> --wait
Example:
oleander spark jobs submit process_sales_data.py --namespace finance --name process-sales-data --wait

Common submit options

  • --cluster: Cluster name. Defaults to the oleander-managed cluster when omitted.
  • --namespace (required): Namespace for the job, a logical group such as a team or project.
  • --name (required): Job name. Runs with the same namespace and name are grouped under the same job.
  • --args: Spark job entrypoint arguments.
  • --sparkConf: Spark configurations without --conf, for example spark.default.parallelism=8. Separate multiple configurations with whitespace.
  • --packages: Extra package coordinates.
  • --jobTags: Job-specific tags in key=value form. Separate multiple tags with whitespace.
  • --runTags: Run-specific tags.
  • --wait: Wait until the job finishes.

Oleander-managed submit options

  • --driverMachineType: oleander Spark driver machine type.
  • --executorMachineType: oleander Spark executor machine type.
  • --executorNumbers: Number of executor instances.

Registered EMR Serverless Spark

Register your EMR Serverless cluster and target it by name when submitting jobs. Include --cluster <name> and provide the S3 entrypoint to a .py or .jar artifact.

Register an EMR Serverless cluster

oleander spark clusters register <name> \
  --type emr-serverless \
  --region <region> \
  --account-id <awsAccountId> \
  --controller-role-arn <controllerRoleArn> \
  --execution-role-arn <executionRoleArn> \
  --application-id <applicationId> \
  --log-bucket <logBucket>

Register options

  • --region: AWS region of the EMR Serverless application.
  • --account-id: AWS account ID of the EMR Serverless application.
  • --controller-role-arn: IAM role ARN oleander assumes to start job runs. Add this to the role’s trust policy so oleander can assume it:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::579897423473:root"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "oleander"
                }
            }
        }
    ]
}
Add this permissions policy to the controller role so oleander can run the job:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowStartJobRun",
            "Effect": "Allow",
            "Action": "emr-serverless:StartJobRun",
            "Resource": "arn:aws:emr-serverless:<REGION>:<ACCOUNT_ID>:/applications/<APPLICATION_ID>"
        },
        {
            "Sid": "AllowGetJobRun",
            "Effect": "Allow",
            "Action": "emr-serverless:GetJobRun",
            "Resource": "arn:aws:emr-serverless:<REGION>:<ACCOUNT_ID>:/applications/<APPLICATION_ID>/jobruns/*"
        },
        {
            "Sid": "PassExecutionRole",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::<ACCOUNT_ID>:role/<JOB_EXECUTION_ROLE_NAME>"
        },
        {
            "Sid": "ReadLogFromS3",
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::<LOG_BUCKET_NAME>/*"
        }
    ]
}
  • --execution-role-arn: IAM role ARN the job uses; the Spark application runs with this role’s permissions.
  • --application-id: EMR Serverless application ID.
  • --log-bucket: S3 bucket for job logs.

Submit a job to EMR Serverless

oleander spark jobs submit <entrypoint_s3_uri> --cluster <cluster_name> --namespace <namespace> --name <run_name> --wait
Example:
oleander spark jobs submit s3://my-bucket/jobs/process_sales_data.py --cluster my-emr --namespace finance --name process-sales-data --wait

Submit options

  • --cluster (required): Name of the registered cluster.
  • --namespace (required): Namespace for the job, a logical group such as a team or project.
  • --name (required): Job name. Runs with the same namespace and name are grouped under the same job.
  • --args: Spark job entrypoint arguments.
  • --sparkConf: Spark configurations without --conf, for example spark.default.parallelism=8. Separate multiple configurations with whitespace.
  • --packages: Extra package coordinates.
  • --jobTags: Job-specific tags in key=value form. Separate multiple tags with whitespace.
  • --runTags: Run-specific tags.
  • --executionIamPolicy: IAM policy for job permissions. Final permissions are the intersection of the job execution role and this policy.
  • --pyFiles: Extra pyFiles for the PySpark job. Mutually exclusive with --mainClass.
  • --virtualenv: Virtual environment archive for Python jobs.
  • --mainClass: Entrypoint main class for the Java/Scala Spark job. Use instead of Python-specific options such as --pyFiles and --virtualenv.
  • --wait: Wait until the job finishes.

Registered Glue Spark

Register your Glue cluster and target it by name when submitting jobs. Include --cluster <name>. Submit uses the existing Glue job name in your environment.

Register a Glue cluster

oleander spark clusters register <name> \
  --type glue \
  --controller-role-arn <controllerRoleArn>

Register options

  • --controller-role-arn: IAM role ARN oleander assumes to start job runs. Add this to the role’s trust policy so oleander can assume it:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::579897423473:root"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "oleander"
                }
            }
        }
    ]
}
Add this permissions policy to the controller role so oleander can run the job:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowStartAndJobRun",
            "Effect": "Allow",
            "Action": [
                "glue:StartJobRun",
                "glue:GetJobRun"
            ],
            "Resource": "arn:aws:glue:<REGION>:<ACCOUNT_ID>:job/*"
        },
        {
            "Sid": "PassExecutionRole",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::<ACCOUNT_ID>:role/<JOB_EXECUTION_ROLE_NAME>"
        },
        {
            "Sid": "ReadGlueLogs",
            "Effect": "Allow",
            "Action": [
                "logs:DescribeLogStreams"
            ],
            "Resource": "arn:aws:logs:<REGION>:<ACCOUNT_ID>:log-group:/aws-glue/jobs/output"
        },
        {
            "Sid": "CloudWatchLogsGetLogEvents",
            "Effect": "Allow",
            "Action": [
                "logs:GetLogEvents"
            ],
            "Resource": "arn:aws:logs:<REGION>:<ACCOUNT_ID>:log-group:/aws-glue/jobs/output:log-stream:*"
        }
    ]
}

Submit a job to Glue

Use --cluster to select the registered cluster:
oleander spark jobs submit <job_name> --cluster <cluster_name> --namespace <namespace> --name <run_name> --wait
Example:
oleander spark jobs submit process-sales-data --cluster my-glue --namespace finance --name process-sales-data --wait

Submit options

  • --cluster (required): Name of the registered cluster.
  • --namespace (required): Namespace for the job, a logical group such as a team or project.
  • --name (required): Job name. Runs with the same namespace and name are grouped under the same job.
  • --args: Spark job entrypoint arguments.
  • --sparkConf: Spark configurations without --conf, for example spark.default.parallelism=8. Separate multiple configurations with whitespace.
  • --packages: Extra package coordinates.
  • --jobTags: Job-specific tags in key=value form. Separate multiple tags with whitespace.
  • --runTags: Run-specific tags.
  • --executionIamPolicy: IAM policy for job permissions. Final permissions are the intersection of the job execution role and this policy.
  • --workerType: Glue worker type.
  • --numberOfWorkers: Number of Glue workers.
  • --enableAutoScaling: Set to true for auto scaling, false otherwise.
  • --executionClass: Glue execution class. Either STANDARD or FLEX.
  • --timeoutMinutes: Glue job timeout in minutes.
  • --wait: Wait until the job finishes.
When your Spark job runs, oleander captures OpenLineage metadata for lineage and dependencies. View results and the lineage graph in your oleander dashboard.