- CLOUD COMPUTING & DEVOPS
- Reviews
Kubernetes federation across clouds with Anthos
Why Choose This Project?
High-Performance Computing (HPC) clusters are critical for scientific simulations, AI/ML model training, weather forecasting, genomics research, and large-scale data analysis. Traditionally, HPC setups require massive upfront investment in physical servers and networking.
With Google Cloud HPC, students can deploy a scalable, on-demand HPC cluster without purchasing hardware. This project helps students learn parallel computing, workload scheduling, cluster scaling, and cloud-based orchestration — all essential skills for research and enterprise applications.
What You Get
-
On-demand deployment of an HPC cluster on Google Cloud
-
Multiple compute nodes for parallel processing
-
Job scheduling and workload management via Slurm or PBS
-
Real-time monitoring of cluster performance and resources
-
Scalable infrastructure (add/remove compute nodes dynamically)
-
Centralized storage for input/output datasets
-
Secure access with SSH and IAM roles
-
Cost optimization using preemptible VMs
Key Features
| Feature | Description |
|---|---|
| Cluster Deployment | Deploy multi-node HPC clusters on Google Cloud Compute Engine |
| Parallel Computation | Run parallel jobs using MPI (Message Passing Interface) or OpenMP |
| Job Scheduling | Use Slurm or PBS for automated job allocation and management |
| Scalable Nodes | Add or remove compute nodes based on workload demands |
| Centralized Storage | Use Google Cloud Storage or Filestore for shared access across nodes |
| Monitoring & Logging | Track CPU, GPU, memory usage, and job status via Stackdriver |
| Secure Access | Manage access using SSH and Google Cloud IAM authentication |
| Cost Optimization | Use preemptible VMs to lower costs for short-running HPC workloads |
Technology Stack
| Layer | Tools/Technologies |
|---|---|
| Compute Nodes | Google Compute Engine VMs (with optional GPUs) |
| Job Scheduler | Slurm / PBS for job management |
| Parallel Processing | OpenMPI / OpenMP |
| Storage | Google Cloud Storage / Filestore |
| Monitoring | Stackdriver / Google Cloud Monitoring |
| Automation | Deployment Manager / Terraform |
| Authentication | SSH keys / Google Cloud IAM roles |
| Networking | VPC, subnets, firewall rules for cluster communication |
Google Cloud Services Used
| Service | Purpose |
|---|---|
| Compute Engine | Provision virtual machines for HPC nodes |
| Cloud Storage | Centralized storage for datasets |
| Filestore | Shared file system across compute nodes |
| Stackdriver / Monitoring | Monitor resource utilization and logs |
| Cloud IAM | Secure access control and permissions |
| Deployment Manager/Terraform | Automate provisioning of the HPC cluster |
| GPU-enabled VMs (optional) | Accelerate computations for AI/ML |
Working Flow
-
Cluster Provisioning
Deploy multiple Compute Engine instances as compute nodes with shared networking and storage. -
Install HPC Software
Configure MPI/OpenMP, job scheduler (Slurm/PBS), and required scientific libraries. -
Upload Input Data
Store input datasets in Cloud Storage or Filestore accessible by all nodes. -
Submit Jobs
Users submit computational jobs to the scheduler for allocation across nodes. -
Parallel Processing
Compute nodes process workloads in parallel, exchanging data via MPI/OpenMP. -
Monitor Performance
Use Stackdriver to monitor CPU, GPU, memory, and job status. -
Collect Results
Output datasets are aggregated in Cloud Storage or Filestore for analysis. -
Scale Cluster
Dynamically add/remove nodes depending on workload demand.