Skip to main content
TensorPool Job’s are git-style interface for GPUs. They let you run, monitor, and manage fire-and-forget training tasks on GPU clusters. Jobs follow a cluster-first flow: you first create (or select) a cluster, then push a job to it. This gives you full control over the infrastructure your job runs on. Jobs are configured using TOML configuration files that specify your training commands and output files.

Job Configuration

Commands

The commands array specifies shell commands to run sequentially. Each job starts from a fresh virtual environment:
commands = [
    "pip install torch torchvision",
    "python -m pip install -e .",
    "python train.py --epochs 100",
]

Output Files

Define which files to save after job completion. Supports glob patterns:
outputs = [
    "checkpoints/",           # Entire directory
    "model_*.pth",           # Glob pattern
    "results.json",          # Single file
    "/logs/*",               # All files in logs/
]

Ignored Files

Exclude files from being uploaded with your job:
ignore = [
    ".venv",
    "venv/",
    "__pycache__/",
    ".git",
    "*.pyc",
    "data/",                 # Exclude large datasets
]

Job Statuses

Jobs progress through various statuses throughout their lifecycle:
StatusDescription
PendingJob is uploading and waiting to be assigned to a cluster.
RunningJob commands are being executed
CompletedAll job commmands have returned an exit code of 0 and output files have been saved.
ErrorUser-level problem: a command has returned a non-zero exit code. Check the logs for details.
FailedSystem-level problem: the cluster executing the job has failed (e.g., node failure, GPU error). TensorPool will investigate.
CancelingJob cancellation in progress. The job outputs are being saved. The cluster is preserved unless --teardown was set.
CanceledJob was successfully canceled.

Managing Jobs

List Jobs

View all your jobs:
tp job list
List all jobs in your organization:
tp job list --org

Job Information

Get detailed information about a specific job:
tp job info <job_id>

Monitor Jobs

Stream real-time logs from a running job:
tp job listen <job_id>

Pull Output Files

Download output files from a completed job:
tp job pull <job_id>
Force overwrite existing local files:
tp job pull <job_id> --force

Cancel Jobs

Cancel a running job:
tp job cancel <job_id>

Multiple Configurations

You can create multiple configuration files for different experiments:
# Create configs for different experiments
tp job init  # enter "baseline" → creates baseline.tp.toml
tp job init  # enter "experiment" → creates experiment.tp.toml

# Run specific configs on a cluster
tp job push baseline.tp.toml <cluster_id>
tp job push experiment.tp.toml <cluster_id>

Next Steps