Best Practices

Follow these best practices to get the most out of TensorPool while optimizing performance and cost.

SSH Key Management

Keep Your Private Keys Secure

Never share your private key - Only share the public key (.pub file)
Use strong passphrases - Protect your private keys with passphrases
Proper permissions - Set correct permissions on your private key:
```
chmod 600 ~/.ssh/id_ed25519
```
Backup your keys - Keep secure backups of your SSH keys

Organize Your Keys

If you use multiple SSH keys, organize them clearly:

~/.ssh/
├── tensorpool_id_ed25519
├── tensorpool_id_ed25519.pub
├── personal_id_ed25519
└── personal_id_ed25519.pub

Configure SSH to use specific keys:

# ~/.ssh/config
Host *.tensorpool.dev
    IdentityFile ~/.ssh/tensorpool_id_ed25519
    User tensorpool

Cluster Naming

Use Descriptive Names

Use clear, descriptive names that indicate the purpose:

# Good names
tp cluster create 8xB200 -n 4 --name pretraining
tp cluster create 1xH100 --name joshua-workbench
tp cluster create 8xH200 -n 2 --name research-experiments

# Avoid generic names
tp cluster create 8xH100 --name cluster1  # Not descriptive

If you’re in a TensorPool Organization, other people can see your clusters! Descriptive names avoid misunderstandings.

Cost Management

Destroy Clusters When Not in Use

The most important cost-saving practice:

# When you're done training
tp cluster destroy <cluster_id>

Set reminders to check for unused clusters:

# Check your active clusters
tp cluster list

# Destroy unused clusters
tp cluster destroy c-abc123

Monitor Your Resources

Regularly check your active resources:

# List all clusters
tp cluster list

# List all storage volumes
tp storage list

# Check account usage
tp me

Data Persistence

Use Storage Volumes for Important Data

Never rely solely on cluster local storage for important data:

# Create storage volume for datasets
tp storage create 1000 --name training-data

# Attach to your cluster
tp cluster attach <cluster_id> <storage_id>

# Store datasets, checkpoints, and results on storage
/mnt/<storage_id>/

Checkpoint Regularly

Save checkpoints frequently to shared storage:

# Save every N epochs
if epoch % 5 == 0:
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
    }, f"/mnt/{storage_id}/checkpoints/epoch_{epoch}.pt")

Download Results Before Cleanup

Before destroying storage volumes, download important results:

# Download to local machine
rsync -avz tensorpool@<cluster_ip>:/mnt/<storage_id>/results/ ./results/

Multi-Node Training

Configure Distributed Training Properly

Use appropriate distributed training frameworks:

# PyTorch DDP
import torch.distributed as dist
dist.init_process_group(backend="nccl")  # Use NCCL for best performance

# DeepSpeed
import deepspeed
model_engine, _, _, _ = deepspeed.initialize(config=ds_config)

Scale Batch Size with Nodes

When scaling to multiple nodes, adjust your batch size:

# Single node (8 GPUs): batch_size = 256
# 2 nodes (16 GPUs): batch_size = 512
# 4 nodes (32 GPUs): batch_size = 1024

world_size = dist.get_world_size()
batch_size = base_batch_size * world_size

Use Shared Storage for Multi-Node

For multi-node clusters, use shared storage to share data and checkpoints:

# All nodes access the same data
cd /mnt/<storage_id>/dataset

Performance Optimization

Use Mixed Precision Training

Enable FP16 or BF16 for faster training:

# PyTorch AMP
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
with autocast():
    output = model(input)
    loss = criterion(output, target)

Profile Your Code

Find bottlenecks with profiling:

import torch.profiler

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as prof:
    # Your training code
    pass

print(prof.key_averages().table())

Monitoring

Regularly Check Cluster List

Keep track of your active resources:

# Set up an alias for quick checking
echo "alias tpl='tp cluster list'" >> ~/.bashrc

# Use it frequently
tpl

Monitor GPU Usage

When SSH’d into your cluster, monitor GPU usage:

# Watch GPU usage
watch -n 1 nvidia-smi

# Or use gpustat
pip install gpustat
gpustat -i 1

Track Training Metrics

Use tools like Weights & Biases or TensorBoard:

import wandb

wandb.init(project="my-training")
wandb.log({"loss": loss, "accuracy": acc})

Security

API Key Security

Keep your API key secure:

# Use environment variables
export TENSORPOOL_KEY="your_key_here"

# Never commit API keys to git
echo "TENSORPOOL_KEY" >> .gitignore

Network Security

Only open necessary ports
Use SSH key authentication (no passwords)
Regularly rotate SSH keys
Monitor for unauthorized access

Workflow Recommendations

Development Workflow

Start small: Use 1xH100 for development
Test your code: Verify everything works on small scale
Scale up: Move to larger instances once tested
Monitor: Watch metrics and GPU utilization
Clean up: Destroy clusters when done

Production Workflow

Use storage volumes: Store all important data on storage volumes
Checkpoint frequently: Save progress regularly
Monitor costs: Track usage and spending
Automate: Script common workflows
Document: Keep notes on experiments and configurations

Common Mistakes to Avoid

Leaving Clusters Running

Don’t forget to destroy clusters when you’re done:

# Always clean up
tp cluster destroy <cluster_id>

Not Using Storage for Important Data

Don’t store critical data only on cluster local storage:

# Use storage for persistent data
tp storage create 500 --name important-data

Choosing Wrong Instance Type

Don’t use oversized instances for small tasks:

# For development, start small
tp cluster create 1xH100 --name dev  # Good
tp cluster create 8xH200 -n 4 --name dev  # Overkill

Ignoring Checkpoints

Don’t train without saving checkpoints:

# Save regularly
if epoch % checkpoint_interval == 0:
    torch.save(model.state_dict(), checkpoint_path)

Not Monitoring Usage

Don’t let resources run without monitoring:

# Check regularly
tp cluster list
tp me

Guides

SSH Key Management

Keep Your Private Keys Secure

Organize Your Keys

Cluster Naming

Use Descriptive Names

Cost Management

Destroy Clusters When Not in Use

Monitor Your Resources

Data Persistence

Use Storage Volumes for Important Data

Checkpoint Regularly

Download Results Before Cleanup

Multi-Node Training

Configure Distributed Training Properly

Scale Batch Size with Nodes

Use Shared Storage for Multi-Node

Performance Optimization

Use Mixed Precision Training

Profile Your Code

Monitoring

Regularly Check Cluster List

Monitor GPU Usage

Track Training Metrics

Security

API Key Security

Network Security

Workflow Recommendations

Development Workflow

Production Workflow

Common Mistakes to Avoid

Leaving Clusters Running

Not Using Storage for Important Data

Choosing Wrong Instance Type

Ignoring Checkpoints

Not Monitoring Usage

Next Steps

Guides

​SSH Key Management

​Keep Your Private Keys Secure

​Organize Your Keys

​Cluster Naming

​Use Descriptive Names

​Cost Management

​Destroy Clusters When Not in Use

​Monitor Your Resources

​Data Persistence

​Use Storage Volumes for Important Data

​Checkpoint Regularly

​Download Results Before Cleanup

​Multi-Node Training

​Configure Distributed Training Properly

​Scale Batch Size with Nodes

​Use Shared Storage for Multi-Node

​Performance Optimization

​Use Mixed Precision Training

​Profile Your Code

​Monitoring

​Regularly Check Cluster List

​Monitor GPU Usage

​Track Training Metrics

​Security

​API Key Security

​Network Security

​Workflow Recommendations

​Development Workflow

​Production Workflow

​Common Mistakes to Avoid

​Leaving Clusters Running

​Not Using Storage for Important Data

​Choosing Wrong Instance Type

​Ignoring Checkpoints

​Not Monitoring Usage

​Next Steps

SSH Key Management

Keep Your Private Keys Secure

Organize Your Keys

Cluster Naming

Use Descriptive Names

Cost Management

Destroy Clusters When Not in Use

Monitor Your Resources

Data Persistence

Use Storage Volumes for Important Data

Checkpoint Regularly

Download Results Before Cleanup

Multi-Node Training

Configure Distributed Training Properly

Scale Batch Size with Nodes

Use Shared Storage for Multi-Node

Performance Optimization

Use Mixed Precision Training

Profile Your Code

Monitoring

Regularly Check Cluster List

Monitor GPU Usage

Track Training Metrics

Security

API Key Security

Network Security

Workflow Recommendations

Development Workflow

Production Workflow

Common Mistakes to Avoid

Leaving Clusters Running

Not Using Storage for Important Data

Choosing Wrong Instance Type

Ignoring Checkpoints

Not Monitoring Usage

Next Steps