Core Commands
tp cluster create- Deploy a new GPU clustertp cluster list- View all your clusterstp cluster info <cluster_id>- Get detailed information about a clustertp cluster edit <cluster_id>- Edit cluster settingstp cluster attach <cluster_id> <storage_id>- Attach a storage volume to a clustertp cluster detach <cluster_id> <storage_id>- Detach a storage volume from a clustertp cluster destroy <cluster_id>- Terminate a cluster
Creating Clusters
Deploy GPU clusters with simple commands. TensorPool supports both single-node and multi-node cluster configurations.Single-Node Clusters
Single-node clusters are ideal for development, experimentation, and smaller training workloads. They provide direct access to GPU resources without the complexity of distributed training.Supported Instance Types
Single-node clusters support a wide variety of GPU configurations:The
-i flag is optional if you have SSH keys saved on your account via tp me sshkey.Accessing Single-Node Clusters
Single-node clusters provide direct SSH access. Once your cluster is ready:Multi-Node Clusters
Multi-node clusters are designed for distributed training workloads that require scaling across multiple machines. All multi-node clusters come with SLURM preinstalled for job scheduling and resource management.Supported Instance Types
Multi-node support is currently available for:- 8xH200 - 2 or more nodes, each with 8 H200 GPUs
- 8xB200 - 2 or more nodes, each with 8 B200 GPUs
Creating Multi-Node Clusters
Create multi-node clusters by specifying the number of nodes with the-n flag:
The
-i flag is optional if you have SSH keys saved on your account via tp me sshkey.Multi-node support is currently available for 8xH200 and 8xB200 instance types only.
Accessing Multi-Node Clusters
All multi-node clusters come with SLURM preinstalled and configured. For detailed information about using SLURM for distributed training, see the Multi-Node Training Guide.Cluster Architecture
Multi-node clusters use a jumphost architecture for network access. Multi-node clusters consist of:- Jumphost:
{cluster_id}-jumphost- The SLURM login/controller node with a public IP address - Worker Nodes:
{cluster_id}-0,{cluster_id}-1, etc. - Compute nodes with private IP addresses only
Accessing Your Cluster
Follow these steps to access your multi-node cluster:- Get cluster information to see all nodes and their instance IDs:
- SSH into the jumphost (this is the only node with direct public access):
- Access worker nodes from the jumphost. You can use either the instance name or private IP:
Container Images
When creating a cluster, you can optionally specify a container image to run on your nodes. Container images are pre-built, GPU-ready Docker environments that come with CUDA, Python, and common ML tooling pre-installed — so you can start training immediately without manual setup.Using Container Images
Pass the--container flag when creating a cluster:
Available Images
| Image | Flag Value | What’s Included |
|---|---|---|
| Base | base | CUDA 12.8 + cuDNN, Python 3.12, git, curl, rsync, rclone, unzip, build-essential |
| PyTorch | pytorch | Everything in Base + PyTorch 2.6, torchvision, torchaudio, transformers, accelerate, datasets, peft, safetensors, wandb, tensorboard, scikit-learn, einops, opencv |
Clusters without
--container continue to work exactly as before — bare-metal SSH access with no container layer.Cluster and Instance Statuses
A cluster’s status is derived from the statuses of its individual instances. Each instance within a cluster progresses through its own lifecycle, and the cluster’s displayed status reflects the highest-priority status among all its instances.Instance Status Lifecycle
Each instance in a cluster follows this lifecycle:Status Definitions
| Status | Description |
|---|---|
| PENDING | Instance creation request has been submitted and is being queued for provisioning. |
| PROVISIONING | Instance has been allocated and is being provisioned. |
| CONFIGURING | Instance is being configured with software, drivers, networking, and storage. |
| CONTAINER_CREATING | Container image is being bootstrapped on the instance. Only occurs when a cluster is created with a container image. |
| RUNNING | Instance is ready for use. |
| DESTROYING | Instance shutdown in progress, resources are being deallocated. |
| DESTROYED | Instance has been successfully terminated. |
| FAILED | System-level problem (e.g., hardware failure, no capacity). |
Cluster Status Priority
A cluster’s status is determined by the highest-priority status among its instances. Priority order (highest to lowest):- FAILED - Any failed instance causes the cluster to show as failed
- DESTROYING - Cluster is being torn down
- PENDING - Instances are waiting to be provisioned
- PROVISIONING - Instances are being provisioned
- CONFIGURING - Instances are being configured
- CONTAINER_CREATING - Container image is being bootstrapped
- RUNNING - All instances are running
- DESTROYED - All instances have been terminated
RUNNING and 1 is CONFIGURING, the cluster status will show as CONFIGURING.
Clusters targeted by jobs with
--teardown will be automatically destroyed after the job completes or is canceled.Next Steps
- Explore instance types available
- Learn about storage volumes for persistent data
- Read the CLI reference for detailed command options