Performance Computing DevOps Architect
appliedmaterials
Job Description
-
As an architect, you will be responsible fordesigning HPC infrastructure solutions, including compute, networking, storage, and workload management components.
-
You will work closely with cross-functional teams, including Hardware, Software, product management, and business stakeholders, to understandcomputeworkload and translate theminto Platformarchitecture and designs that meet business needs.
-
You will create and maintain detailed system architecture diagrams and specifications.
-
You will evaluate and select appropriate hardware and software components for HPC environments
-
You will Install, configure, and maintain HPC systems, including hardware, software, and networking components
-
You will develop and implement automation scripts for system management and deployment.
-
You will be a subject Matter expert to unblock dependent teams in the HPC domain.
-
You will be expected to develop system benchmarks, profile systems to understand bottlenecks, optimize workflows and processes to improve cost of ownership.
-
Identify and mitigate technical risks and issues throughout the HPC development life cycle.
-
Ensure that ComputeCluster is resilient, reliable, and maintainable.
-
You will be expected to stay abreast of the latest HPC technologies, including Hardware, Software and Networking Solutions
-
Your primary focus will be to understand thecomputeworkload and design HPC cluster with right combination of Nodes, CPU/GPU, Memory, Interconnects and storageto have optimum performance at minimum cost of Ownership.
Our Ideal Candidate
Someone who has the drive and passion to learn quickly, has the ability to multi-task and switch contexts based on business needs.
Qualifications
-
In-depth experience with Linux System administration and Hardware/Software Configuration.
-
Strong knowledge of HPC technologies including cluster computing, high speed interconnects (InfiniBand, RoCE), parallel filesystems (Lustre, GPFS, BeeGFSetc)
-
Experience in creating, maintaining Operating System images with different installation and boot schemes
-
Extremely good with automation tools like Ansible, Chef, Salt-Stack and Scripting languages (Python and Bash)
-
Experience in Creating,maintaining Storage Solutions with different RAID configuration.
-
Ability to design storage solution for different IOPS, Access patterns (Random vs Sequential RW) and tune storage and filesystemsfor better performance.
-
Good of knowledgeNetworking concepts including IP addressing, routing, protocols and Switch configuration for RDMA, VLAN configuration, network bonding etc.
-
Good Knowledge Virtualization, Hardware and Software Hypervisors
-
Good knowledge of containerization technologies like docker, singularity.
-
Experience in Software Defined Networking and Storage.
-
Experience in setting-up remote management protocols like IPMI, Redfish etc.
-
Experience in setting-up and using monitoring systems like Prometheus, Grafana.
-
Experience System profiling and custom tuningfor targetworkloadfor higher performance and low cost of ownership
-
Very good written and verbal communication skills.
-
Very goodinTechnical documentation meant to serve as manuals for non-experts in the field.
Additional Qualifications:
-
Experience in HPC Cluster management and Work-load orchestration software (e.g.SLURM,