Clusters

TensorPool makes it easy to deploy and manage GPU clusters of any size, from single GPUs to large multi-node configurations.

Core Commands

tp cluster create - Deploy a new GPU cluster
tp cluster list - View all your clusters
tp cluster info <cluster_id> - Get detailed information about a cluster
tp cluster edit <cluster_id> - Edit cluster settings
tp cluster attach <cluster_id> <storage_id> - Attach a storage volume to a cluster
tp cluster detach <cluster_id> <storage_id> - Detach a storage volume from a cluster
tp cluster destroy <cluster_id> - Terminate a cluster

Creating Clusters

Deploy GPU clusters with simple commands. TensorPool supports both single-node and multi-node cluster configurations.

Single-Node Clusters

Single-node clusters are ideal for development, experimentation, and smaller training workloads. They provide direct access to GPU resources without the complexity of distributed training.

Supported Instance Types

Single-node clusters support a wide variety of GPU configurations:

# Single H100
tp cluster create 1xH100 -i ~/.ssh/id_ed25519.pub

# Single node with 8x H200
tp cluster create 8xH200 -i ~/.ssh/id_ed25519.pub

# Single node with 8x B200
tp cluster create 8xB200 -i ~/.ssh/id_ed25519.pub

The -i flag is optional if you have SSH keys saved on your account via tp me sshkey.

Accessing Single-Node Clusters

Single-node clusters provide direct SSH access. Once your cluster is ready:

# Get cluster information to find the instance ID
tp cluster info <cluster_id>

# SSH directly into the instance
tp ssh <instance_id>

# Run your training script directly on the node
python train.py

Multi-Node Clusters

Multi-node clusters are designed for distributed training workloads that require scaling across multiple machines. All multi-node clusters come with SLURM preinstalled for job scheduling and resource management.

Supported Instance Types

Multi-node support is currently available for:

8xH200 - 2 or more nodes, each with 8 H200 GPUs
8xB200 - 2 or more nodes, each with 8 B200 GPUs

Creating Multi-Node Clusters

Create multi-node clusters by specifying the number of nodes with the -n flag:

# 2-node cluster with 8xH200 each (16 GPUs total)
tp cluster create 8xH200 -i ~/.ssh/id_ed25519.pub -n 2

# 4-node cluster with 8xB200 each (32 GPUs total)
tp cluster create 8xB200 -i ~/.ssh/id_ed25519.pub -n 4

The -i flag is optional if you have SSH keys saved on your account via tp me sshkey.

Multi-node support is currently available for 8xH200 and 8xB200 instance types only.

Accessing Multi-Node Clusters

All multi-node clusters come with SLURM preinstalled and configured. For detailed information about using SLURM for distributed training, see the Multi-Node Training Guide.

Cluster Architecture

Multi-node clusters use a jumphost architecture for network access. Multi-node clusters consist of:

Jumphost: {cluster_id}-jumphost - The SLURM login/controller node with a public IP address
Worker Nodes: {cluster_id}-0, {cluster_id}-1, etc. - Compute nodes with private IP addresses only

Accessing Your Cluster

Follow these steps to access your multi-node cluster:

Get cluster information to see all nodes and their instance IDs:

tp cluster info <cluster_id>

SSH into the jumphost (this is the only node with direct public access):

tp ssh <jumphost-instance-id>

Access worker nodes from the jumphost. You can use either the instance name or private IP:

# Using instance name (replace <cluster_id> with your actual cluster ID)
ssh <cluster_id>-0
ssh <cluster_id>-1

# Or using the private IP address (found in cluster info)
ssh <worker-node-private-ip>

Note: The jumphost serves as the SLURM login node where you submit distributed training jobs. Worker nodes are only accessible from within the cluster network.

Container Images

When creating a cluster, you can optionally specify a container image to run on your nodes. Container images are pre-built, GPU-ready Docker environments that come with CUDA, Python, and common ML tooling pre-installed — so you can start training immediately without manual setup.

Using Container Images

Pass the --container flag when creating a cluster:

# Minimal GPU environment
tp cluster create 1xH100 --container base

# Full PyTorch ML stack
tp cluster create 8xH200 --container pytorch

Container images are only supported on single-node clusters. Multi-node clusters do not support --container.

Available Images

Image	Flag Value	What’s Included
Base	`base`	CUDA 12.8 + cuDNN, Python 3.12, git, curl, rsync, rclone, unzip, build-essential
PyTorch	`pytorch`	Everything in Base + PyTorch 2.6, torchvision, torchaudio, transformers, accelerate, datasets, peft, safetensors, wandb, tensorboard, scikit-learn, einops, opencv

Clusters without --container continue to work exactly as before — bare-metal SSH access with no container layer.

Cluster and Instance Statuses

A cluster’s status is derived from the statuses of its individual instances. Each instance within a cluster progresses through its own lifecycle, and the cluster’s displayed status reflects the highest-priority status among all its instances.

Instance Status Lifecycle

Each instance in a cluster follows this lifecycle:

Status Definitions

Status	Description
PENDING	Instance creation request has been submitted and is being queued for provisioning.
PROVISIONING	Instance has been allocated and is being provisioned.
CONFIGURING	Instance is being configured with software, drivers, networking, and storage.
CONTAINER_CREATING	Container image is being bootstrapped on the instance. Only occurs when a cluster is created with a container image.
RUNNING	Instance is ready for use.
DESTROYING	Instance shutdown in progress, resources are being deallocated.
DESTROYED	Instance has been successfully terminated.
FAILED	System-level problem (e.g., hardware failure, no capacity).

Cluster Status Priority

A cluster’s status is determined by the highest-priority status among its instances. Priority order (highest to lowest):

FAILED - Any failed instance causes the cluster to show as failed
DESTROYING - Cluster is being torn down
PENDING - Instances are waiting to be provisioned
PROVISIONING - Instances are being provisioned
CONFIGURING - Instances are being configured
CONTAINER_CREATING - Container image is being bootstrapped
RUNNING - All instances are running
DESTROYED - All instances have been terminated

For example, if a cluster has 3 instances where 2 are RUNNING and 1 is CONFIGURING, the cluster status will show as CONFIGURING.

Clusters targeted by jobs with --teardown will be automatically destroyed after the job completes or is canceled.

Next Steps

Explore instance types available
Learn about storage volumes for persistent data
Read the CLI reference for detailed command options

Getting Started

Core Features

CLI Reference

Resources

Core Commands

Creating Clusters

Single-Node Clusters

Supported Instance Types

Accessing Single-Node Clusters

Multi-Node Clusters

Supported Instance Types

Creating Multi-Node Clusters

Accessing Multi-Node Clusters

Cluster Architecture

Accessing Your Cluster

Container Images

Using Container Images

Available Images

Cluster and Instance Statuses

Instance Status Lifecycle

Status Definitions

Cluster Status Priority

Next Steps

Getting Started

Core Features

CLI Reference

Resources

​Core Commands

​Creating Clusters

​Single-Node Clusters

​Supported Instance Types

​Accessing Single-Node Clusters

​Multi-Node Clusters

​Supported Instance Types

​Creating Multi-Node Clusters

​Accessing Multi-Node Clusters

​Cluster Architecture

​Accessing Your Cluster

​Container Images

​Using Container Images

​Available Images

​Cluster and Instance Statuses

​Instance Status Lifecycle

​Status Definitions

​Cluster Status Priority

​Next Steps

Core Commands

Creating Clusters

Single-Node Clusters

Supported Instance Types

Accessing Single-Node Clusters

Multi-Node Clusters

Supported Instance Types

Creating Multi-Node Clusters

Accessing Multi-Node Clusters

Cluster Architecture

Accessing Your Cluster

Container Images

Using Container Images

Available Images

Cluster and Instance Statuses

Instance Status Lifecycle

Status Definitions

Cluster Status Priority

Next Steps