Jobs

TensorPool Job’s are git-style interface for GPUs. They let you run, monitor, and manage fire-and-forget training tasks on GPU clusters. Jobs follow a cluster-first flow: you first create (or select) a cluster, then push a job to it. This gives you full control over the infrastructure your job runs on. Jobs are configured using TOML configuration files that specify your training commands and output files.

Job Configuration

Commands

The commands array specifies shell commands to run sequentially. Each job starts from a fresh virtual environment:

commands = [
    "pip install torch torchvision",
    "python -m pip install -e .",
    "python train.py --epochs 100",
]

Output Files

Define which files to save after job completion. Supports glob patterns:

outputs = [
    "checkpoints/",           # Entire directory
    "model_*.pth",           # Glob pattern
    "results.json",          # Single file
    "/logs/*",               # All files in logs/
]

Ignored Files

Exclude files from being uploaded with your job:

ignore = [
    ".venv",
    "venv/",
    "__pycache__/",
    ".git",
    "*.pyc",
    "data/",                 # Exclude large datasets
]

Job Statuses

Jobs progress through various statuses throughout their lifecycle:

Status	Description
Pending	Job is uploading and waiting to be assigned to a cluster.
Running	Job commands are being executed
Completed	All job commmands have returned an exit code of 0 and output files have been saved.
Error	User-level problem: a command has returned a non-zero exit code. Check the logs for details.
Failed	System-level problem: the cluster executing the job has failed (e.g., node failure, GPU error). TensorPool will investigate.
Canceling	Job cancellation in progress. The job outputs are being saved. The cluster is preserved unless `--teardown` was set.
Canceled	Job was successfully canceled.

Managing Jobs

List Jobs

View all your jobs:

tp job list

List all jobs in your organization:

tp job list --org

Job Information

Get detailed information about a specific job:

tp job info <job_id>

Monitor Jobs

Stream real-time logs from a running job:

tp job listen <job_id>

Pull Output Files

Download output files from a completed job:

tp job pull <job_id>

Force overwrite existing local files:

tp job pull <job_id> --force

Cancel Jobs

Cancel a running job:

tp job cancel <job_id>

Multiple Configurations

You can create multiple configuration files for different experiments:

# Create configs for different experiments
tp job init  # enter "baseline" → creates baseline.tp.toml
tp job init  # enter "experiment" → creates experiment.tp.toml

# Run specific configs on a cluster
tp job push baseline.tp.toml <cluster_id>
tp job push experiment.tp.toml <cluster_id>

Next Steps

Learn about job commands
Explore multi-node training for distributed workloads
Manage SSH keys for cluster access

Getting Started

Core Features

CLI Reference

Resources

Job Configuration

Commands

Output Files

Ignored Files

Job Statuses

Managing Jobs

List Jobs

Job Information

Monitor Jobs

Pull Output Files

Cancel Jobs

Multiple Configurations

Next Steps

Getting Started

Core Features

CLI Reference

Resources

​Job Configuration

​Commands

​Output Files

​Ignored Files

​Job Statuses

​Managing Jobs

​List Jobs

​Job Information

​Monitor Jobs

​Pull Output Files

​Cancel Jobs

​Multiple Configurations

​Next Steps

Job Configuration

Commands

Output Files

Ignored Files

Job Statuses

Managing Jobs

List Jobs

Job Information

Monitor Jobs

Pull Output Files

Cancel Jobs

Multiple Configurations

Next Steps