SSH Key Management
Keep Your Private Keys Secure
- Never share your private key - Only share the public key (
.pubfile) - Use strong passphrases - Protect your private keys with passphrases
- Proper permissions - Set correct permissions on your private key:
- Backup your keys - Keep secure backups of your SSH keys
Organize Your Keys
If you use multiple SSH keys, organize them clearly:Cluster Naming
Use Descriptive Names
Use clear, descriptive names that indicate the purpose:Cost Management
Destroy Clusters When Not in Use
The most important cost-saving practice:Monitor Your Resources
Regularly check your active resources:Data Persistence
Use Storage Volumes for Important Data
Never rely solely on cluster local storage for important data:Checkpoint Regularly
Save checkpoints frequently to shared storage:Download Results Before Cleanup
Before destroying storage volumes, download important results:Multi-Node Training
Configure Distributed Training Properly
Use appropriate distributed training frameworks:Scale Batch Size with Nodes
When scaling to multiple nodes, adjust your batch size:Use Shared Storage for Multi-Node
For multi-node clusters, use shared storage to share data and checkpoints:Performance Optimization
Use Mixed Precision Training
Enable FP16 or BF16 for faster training:Profile Your Code
Find bottlenecks with profiling:Monitoring
Regularly Check Cluster List
Keep track of your active resources:Monitor GPU Usage
When SSH’d into your cluster, monitor GPU usage:Track Training Metrics
Use tools like Weights & Biases or TensorBoard:Security
API Key Security
Keep your API key secure:Network Security
- Only open necessary ports
- Use SSH key authentication (no passwords)
- Regularly rotate SSH keys
- Monitor for unauthorized access
Workflow Recommendations
Development Workflow
- Start small: Use
1xH100for development - Test your code: Verify everything works on small scale
- Scale up: Move to larger instances once tested
- Monitor: Watch metrics and GPU utilization
- Clean up: Destroy clusters when done
Production Workflow
- Use storage volumes: Store all important data on storage volumes
- Checkpoint frequently: Save progress regularly
- Monitor costs: Track usage and spending
- Automate: Script common workflows
- Document: Keep notes on experiments and configurations