Performance Computing DevOps Architect

appliedmaterials

Bengaluru, India NM Years Exp Posted 1h ago

Job Description

  • As an architect, you will be responsible fordesigning HPC infrastructure solutions, including compute, networking, storage, and workload management components.

  • You will work closely with cross-functional teams, including Hardware, Software, product management, and business stakeholders, to understandcomputeworkload and translate theminto Platformarchitecture and designs that meet business needs.

  • You will create and maintain detailed system architecture diagrams and specifications. 

  • You will evaluate and select appropriate hardware and software components for HPC environments

  • You will Install, configure, and maintain HPC systems, including hardware, software, and networking components

  • You will develop and implement automation scripts for system management and deployment. 

  • You will be a subject Matter expert to unblock dependent teams in the HPC domain.

  • You will be expected to develop system benchmarks, profile systems to understand bottlenecks, optimize workflows and processes to improve cost of ownership.

  • Identify and mitigate technical risks and issues throughout the HPC development life cycle.

  • Ensure that ComputeCluster is resilient, reliable, and maintainable.

  • You will be expected to stay abreast of the latest HPC technologies, including Hardware, Software and Networking Solutions

  • Your primary focus will be to understand thecomputeworkload and design HPC cluster with right combination of Nodes, CPU/GPU, Memory, Interconnects and storageto have optimum performance at minimum cost of Ownership.

 

 

Our Ideal Candidate

Someone who has the drive and passion to learn quickly, has the ability to multi-task and switch contexts based on business needs.

Qualifications

  • In-depth experience with Linux System administration and Hardware/Software Configuration.

  • Strong knowledge of HPC technologies including cluster computing, high speed interconnects (InfiniBand, RoCE), parallel filesystems (Lustre, GPFS, BeeGFSetc)

  • Experience in creating, maintaining Operating System images with different installation and boot schemes

  • Extremely good with automation tools like Ansible, Chef, Salt-Stack and Scripting languages (Python and Bash)

  • Experience in Creating,maintaining Storage Solutions with different RAID configuration.

  • Ability to design storage solution for different IOPS, Access patterns (Random vs Sequential RW) and tune storage and filesystemsfor better performance.

  • Good of knowledgeNetworking concepts including IP addressing, routing, protocols and Switch configuration for RDMA, VLAN configuration, network bonding etc.

  • Good Knowledge Virtualization, Hardware and Software Hypervisors

  • Good knowledge of containerization technologies like docker, singularity.

  • Experience in Software Defined Networking and Storage.

  • Experience in setting-up remote management protocols like IPMI, Redfish etc.

  • Experience in setting-up and using monitoring systems like Prometheus, Grafana.

  • Experience System profiling and custom tuningfor targetworkloadfor higher performance and low cost of ownership

  • Very good written and verbal communication skills.

  • Very goodinTechnical documentation meant to serve as manuals for non-experts in the field.

 

Additional Qualifications:

 

  • Experience in HPC Cluster management and Work-load orchestration software (e.g.SLURM,

Similar Openings for You