Site Reliability Engineer
nttdata
Job Description
Key Responsibilities
- Deploy, configure, and manage LogicMonitor for cloud, on-prem, and hybrid environments.
- Integrate LogicMonitor with servers, network devices, databases, and applications.
- Develop and optimize custom dashboards, alerts, and reports to improve system visibility.
- Perform log analysis, event correlation, and performance tuning for proactive monitoring.
- Automate monitoring tasks using APIs, scripting (Python, PowerShell, Bash), and integrations.
- Collaborate with DevOps, SRE, and IT teams to define and implement observability best practices.
- Troubleshoot and resolve monitoring configuration issues, false positives, and alert noise.
- Implement best practices for monitoring, alerting thresholds, and anomaly detection.
- Document monitoring configurations, integrations, and troubleshooting guides.
- Provide technical support and training to internal teams on monitoring best practices.
Required Qualifications
- 3-5 years of experience in IT monitoring, observability, or system administration.
- Hands-on experience with LogicMonitor setup, configuration, and administration.
- Strong knowledge of cloud platforms (AWS, Azure, GCP) and on-prem infrastructure.
- Proficiency in monitoring protocols (SNMP, WMI, API-based integrations, Syslog).
- Experience with automation tools (Terraform, Ansible) and scripting languages.
- Familiarity with networking, server administration, and application monitoring.
- Strong troubleshooting skills in identifying performance bottlenecks and optimizing monitoring strategies.
- Excellent communication and documentation skills.
Preferred Qualifications
- LogicMonitor Certified Associate or Professional certification.
- Experience with ITSM, incident management, and change control processes.
- Knowledge of containerized environments (Docker, Kubernetes) and microservices monitoring.
- Experience integrating LogicMonitor with SIEM, AIOps, and analytics tools.