Skip to main content
Follow these best practices to get the most out of TensorPool while optimizing performance and cost.

SSH Key Management

Keep Your Private Keys Secure

  • Never share your private key - Only share the public key (.pub file)
  • Use strong passphrases - Protect your private keys with passphrases
  • Proper permissions - Set correct permissions on your private key:
    chmod 600 ~/.ssh/id_ed25519
    
  • Backup your keys - Keep secure backups of your SSH keys

Organize Your Keys

If you use multiple SSH keys, organize them clearly:
~/.ssh/
├── tensorpool_id_ed25519
├── tensorpool_id_ed25519.pub
├── personal_id_ed25519
└── personal_id_ed25519.pub
Configure SSH to use specific keys:
# ~/.ssh/config
Host *.tensorpool.dev
    IdentityFile ~/.ssh/tensorpool_id_ed25519
    User tensorpool

Cluster Naming

Use Descriptive Names

Use clear, descriptive names that indicate the purpose:
# Good names
tp cluster create 8xB200 -n 4 --name pretraining
tp cluster create 1xH100 --name joshua-workbench
tp cluster create 8xH200 -n 2 --name research-experiments

# Avoid generic names
tp cluster create 8xH100 --name cluster1  # Not descriptive
If you’re in a TensorPool Organization, other people can see your clusters! Descriptive names avoid misunderstandings.

Cost Management

Destroy Clusters When Not in Use

The most important cost-saving practice:
# When you're done training
tp cluster destroy <cluster_id>
Set reminders to check for unused clusters:
# Check your active clusters
tp cluster list

# Destroy unused clusters
tp cluster destroy c-abc123

Monitor Your Resources

Regularly check your active resources:
# List all clusters
tp cluster list

# List all storage volumes
tp storage list

# Check account usage
tp me

Data Persistence

Use Storage Volumes for Important Data

Never rely solely on cluster local storage for important data:
# Create storage volume for datasets
tp storage create 1000 --name training-data

# Attach to your cluster
tp cluster attach <cluster_id> <storage_id>

# Store datasets, checkpoints, and results on storage
/mnt/<storage_id>/

Checkpoint Regularly

Save checkpoints frequently to shared storage:
# Save every N epochs
if epoch % 5 == 0:
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
    }, f"/mnt/{storage_id}/checkpoints/epoch_{epoch}.pt")

Download Results Before Cleanup

Before destroying storage volumes, download important results:
# Download to local machine
rsync -avz tensorpool@<cluster_ip>:/mnt/<storage_id>/results/ ./results/

Multi-Node Training

Configure Distributed Training Properly

Use appropriate distributed training frameworks:
# PyTorch DDP
import torch.distributed as dist
dist.init_process_group(backend="nccl")  # Use NCCL for best performance

# DeepSpeed
import deepspeed
model_engine, _, _, _ = deepspeed.initialize(config=ds_config)

Scale Batch Size with Nodes

When scaling to multiple nodes, adjust your batch size:
# Single node (8 GPUs): batch_size = 256
# 2 nodes (16 GPUs): batch_size = 512
# 4 nodes (32 GPUs): batch_size = 1024

world_size = dist.get_world_size()
batch_size = base_batch_size * world_size

Use Shared Storage for Multi-Node

For multi-node clusters, use shared storage to share data and checkpoints:
# All nodes access the same data
cd /mnt/<storage_id>/dataset

Performance Optimization

Use Mixed Precision Training

Enable FP16 or BF16 for faster training:
# PyTorch AMP
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
with autocast():
    output = model(input)
    loss = criterion(output, target)

Profile Your Code

Find bottlenecks with profiling:
import torch.profiler

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as prof:
    # Your training code
    pass

print(prof.key_averages().table())

Monitoring

Regularly Check Cluster List

Keep track of your active resources:
# Set up an alias for quick checking
echo "alias tpl='tp cluster list'" >> ~/.bashrc

# Use it frequently
tpl

Monitor GPU Usage

When SSH’d into your cluster, monitor GPU usage:
# Watch GPU usage
watch -n 1 nvidia-smi

# Or use gpustat
pip install gpustat
gpustat -i 1

Track Training Metrics

Use tools like Weights & Biases or TensorBoard:
import wandb

wandb.init(project="my-training")
wandb.log({"loss": loss, "accuracy": acc})

Security

API Key Security

Keep your API key secure:
# Use environment variables
export TENSORPOOL_KEY="your_key_here"

# Never commit API keys to git
echo "TENSORPOOL_KEY" >> .gitignore

Network Security

  • Only open necessary ports
  • Use SSH key authentication (no passwords)
  • Regularly rotate SSH keys
  • Monitor for unauthorized access

Workflow Recommendations

Development Workflow

  1. Start small: Use 1xH100 for development
  2. Test your code: Verify everything works on small scale
  3. Scale up: Move to larger instances once tested
  4. Monitor: Watch metrics and GPU utilization
  5. Clean up: Destroy clusters when done

Production Workflow

  1. Use storage volumes: Store all important data on storage volumes
  2. Checkpoint frequently: Save progress regularly
  3. Monitor costs: Track usage and spending
  4. Automate: Script common workflows
  5. Document: Keep notes on experiments and configurations

Common Mistakes to Avoid

Leaving Clusters Running

Don’t forget to destroy clusters when you’re done:
# Always clean up
tp cluster destroy <cluster_id>

Not Using Storage for Important Data

Don’t store critical data only on cluster local storage:
# Use storage for persistent data
tp storage create 500 --name important-data

Choosing Wrong Instance Type

Don’t use oversized instances for small tasks:
# For development, start small
tp cluster create 1xH100 --name dev  # Good
tp cluster create 8xH200 -n 4 --name dev  # Overkill

Ignoring Checkpoints

Don’t train without saving checkpoints:
# Save regularly
if epoch % checkpoint_interval == 0:
    torch.save(model.state_dict(), checkpoint_path)

Not Monitoring Usage

Don’t let resources run without monitoring:
# Check regularly
tp cluster list
tp me

Next Steps