Site Reliability Senior Engineer
veeva
Job Description
What You’ll Do
- Rapidly build new applications on an existing, robust enterprise platform
- Build new cloud infrastructure from scratch following the best practices in software development
- Drive new features and improvements in a fast-changing environment
- Partner with product management, design, and QA to deliver cutting-edge solutions and direct value to our customers
- Work on multiple layers of our stack including backend (primary), front-end, and Infrastructure
- Drive new features and improvements in a fast-changing environment
- Build tools and automation that eliminate work and reduce the time it takes to resolve an issue
- You want to make the system better every day and are self-driven to learn all that is necessary to provide full-stack diagnostics and determine the root cause of problems
- Ensure our platform meets the scalability and reliability needs of our customers
- During an incident, lead the effort to triage and mitigate. You might need to perform periodic on-call duty if issues are escalated
- Strategize with engineering teams on complex problems. You know how to support a system that is used by 3M users and can help dev teams make decisions based on recommendations of what will work in production before it ships
- Participate in engineering design reviews of new features. Drive focused initiatives that improve operational efficiency and scalability of the platform
- Communicate effectively with engineering teams, and describe problems succinctly with sufficient detail that you can hand off an ongoing problem to another team or a peer for completion. Engage in real-time communication during outages with both technical and non-technical audiences
Requirements
- 5+ years experience in Java, preferably at an enterprise cloud software company
- Proven ability to write clean, testable, readable code in a team environment
- Hands-on experience with open-source technologies, such as Spring, MySQL,
- Hibernate, Solr, Maven, Git, Tomcat, Linux, AWS, Vagrant, Docker, Kubernetes
- 3+ years of experience in relational databases with a mastery of SQL
- Demonstrated history of incident management and leadership ability
- Experience in handling production outages and root-cause analysis
- Hands-on operational experience in a high-volume or critical production service environment
- Effective communication skills across all levels -- whether talking to individual contributors or executives
- Solid scripting skills; experience with Shell, Bash, Ansible, Python, Go, Ruby, etc.
- Ability to handle the periodic, on-call duty
- Fluent in English - both written and verbal
- We are looking for strong mentors with a proven record of making your team better