Service Reliability Engineer
Apply now »Date: Mar 10, 2023
Location: Toronto, ON, CA
Company: LifeLabs
LifeLabs is the largest community diagnostics laboratory in Canada, serving the healthcare needs of Canadians for over 50 years. Our team members are truly centred around our customers, and we know that behind every lab requisition, sample being tested, or investment in technology is an individual and their family counting on us.
Consistently named one of Canada's Best Employers by Forbes, LifeLabs has also been recognized for having an award-winning Mental Health Program from Benefits Canada. The passion and commitment of over 6,000 diverse and innovative team members unites and motivates us to ensure our customers receive high quality tests and results that they can trust. Agile, customer-centred, caring and teamwork: we live these values every day in what we do to support our customers and healthcare providers, driving forward our vision of empowering a healthier you.
Make a difference – join the LifeLabs team today!
Reports to: Senior Manager, Service Reliability Engineering
Purpose of the Role The SRE will be a part of the team responsible for helping to support 24x7 uptime and availability of production mission critical customer facing cloud services distributed across multiple regions. You'll help to create more consistent, automated push button environments across all tiers, proactively test and tune all aspects of the infrastructure, monitor and respond to system notifications and alerts, and continually work to optimize and improve the performance, security and reliability of our systems as our business scales.
This is a full time remote position. Preferred location is Ontario.
Core Accountabilities
- Apply automation and software to any tasks or parts of the system that would benefit from it or are performed manually
- Able to troubleshoot complicated, cross platform issues handling OS, Networking, Database in a cloud-based SaaS environment and handle live production incidents, debug/troubleshoot application, and infrastructure issues, follow and implement SRE best practices
- Monitor application performance take steps to improve overall application performance and stability and follow through with implementation
- Conduct system analysis, configuration management and develops improvements for system software performance, availability, and reliability
- Design, write, ship, and motivate the creation of software and systems to increase observability, product reliability and organizational efficiency
- Work closely with software engineers and testers to ensure the system is responding properly to no-functional requirements such as performance, security, and availability
- Document your system knowledge as you acquire it over time, create runbooks, and ensure critical system information is readily available to those who need it
- Maintain and monitor deployment, orchestration, of the servers, containers, databases, and general backend infrastructure • Keep up-to date with security and proactively identify, diagnose, and solve complex security issues
- Develop standards and maintain self-provisioning infrastructure using tools like Ansible, Terraform, and Docker
- Facilitate effective problem management but working with other stakeholders in response to incidents, facilitate post-mortems and ensure closure of follow-up actions items.
Minimum Qualifications and Skills
- Post-secondary degree or diploma in Business and/or Computer Science (or related education)
- 5+ years’ experience as SRE/DevOps Engineer
- 3+ years experience in Containerization and orchestration
- ITIL v4.0 or DevOps Certification preferred
- Experience working with engineering teams to understand their product requirements and how they build/test/deploy their software applications
- Experience with Infrastructure As Code (Terraform, Cloud Formation, Ansible)
- Experience with distributed storage technologies like NFS, HDFS, Ceph, S3 as well as dynamic resource management frameworks (Mesos, Kubernetes, Yarn)
- Knowledge and proven hands-on experience in large-scale databases and distributed technologies, such as Kafka
- Basic programming and scripting skills (preferably Golang, bash, shell, python etc.,)
- Ability to provide advice, best practices and recommendations for the operation and deployment of Microsoft Azure
- Experience in monitoring and analyzing infrastructure performance using standard performance monitoring tools - BHOM, New Relic, Perfmon, PerfView, ProcDump, DebugDiag
- Familiarity with Linux and UNIX systems (e.g. CentOS, RedHat) and command line system administration such as Bash, VIM, SSH
- Hands on experience in configuration management of server farms (using tools such as Puppet, Chef, Ansible, etc.,)
- Network routing, Load balancing and Networking protocols, a base knowledge of TCP/IP, with an understanding of HTTP and DNS
#LI SW1 #INDEED
At LifeLabs, we strive to create an inclusive and equitable workplace where our team members and the communities we serve feel accepted, valued, and respected.
In accordance with LifeLabs’ Accessibility Policy, the Accessibility for Ontarians with Disabilities Act, and the Ontario Human Rights Code, accommodations are available by request for candidates taking part in all aspects of the recruitment and selection process. For a confidential inquiry or to request an accommodation, please contact your recruiter or email careers@lifelabs.com.
LifeLabs is committed to providing a safe environment for our employees, customers, and the communities we serve. We have been a leader throughout the COVID-19 pandemic regarding health and safety measures and have always put our employees and customers at the center of every decision that we make. As an organization in the health care sector, we believe the COVID vaccination adds a layer of protection that complements the extensive and necessary health and safety protocols that we have taken to date. With this in mind, we currently require all LifeLabs employees, contractors, students and volunteers to be fully vaccinated.
LifeLabs operates under a distributed workforce model, where employee flexibility is a key priority. Further information will be provided during the interview process on what this means for employees.
Job Segment:
Cloud, Testing, Counseling, Systems Analyst, Computer Science, Technology, Healthcare