NUS
 
ISS
 

Digital Resilience Certification

Overview

Part of -
Duration 3 days
Course Time 9:00am - 5:00pm
Enquiry Please email ask-iss@nus.edu.sg for more details.
Digital systems are nowadays mission-critical to businesses as well as being essential to individuals.  

Hence, they need to be highly resilient – i.e. have high uptime despite component failures; perform well even with high workloads; and are free from functional errors.

The need for high resilience can be seen from the great public impact, huge publicity, as well as penalties to businesses and their leaders, that arise from system resiliency issues. Examples of these include a whole country’s air services suspended due to failure of a critical aviation system; foreign banks having fines and resignation of leaders due their system resilience issues; government scrutiny on chaotic system performance for online booking of popular concerts, or for errors in public digital services.  

Organisations realise the importance of ensuring high resilience for their systems and are willing to make significant investments in hardware, software and skills to achieve it, so as to keep their business running smoothly, satisfy their customers, scale well to meet growing business volumes, and avoid the losses & penalties arising from resilience issues.

This course, taught by practitioners from the industry, equips IT professionals with industry best-practices (e.g. from Netflix, Google, AWS etc) to manage the systems effectively across the whole system lifecycle, to ensure high resiliency.  

Key Takeaways

 

At the end of this course, the participants will be able to have the competencies in the below to OPTIMISE and IMPROVE the RESILIENCY of their digital systems:

  • DEFINE optimal cost-effective resilience targets for digital services and manage expectations accordingly, as well as collaborate with users on business requirements that facilitate such resiliency 
  • DESIGN for below capabilities using industry best practices
    • Availability
    • Recoverability/Maintainability
    • Performance
  • BUILD & TEST the services using best practices (e.g. "shift-left"; scaling tests, failure recovery tests etc) 
  • OBSERVE through relevant observability and monitoring tools and best practices including use of AI  
  • OPERATE with best practices to keep up with changing risks and rising costs to ensure high resilience





Who Should Attend

The course is designed for:
• IT Managers
• Solution Architects
• Application System Team Leads
• DevSecOps Team Lead/Manager
• Senior developers
• Ops / SRE members
• Incidence/Crisis team members

Pre-requisites
• Currently in above roles, or equivalent
• Have at least 3 years of working experience in designing, developing or managing  digital services (3 years can include relevant time from earlier job roles) 




What Will Be Covered

  • DEFINE optimal cost-effective availability and performance targets for the digital services and error budgets (from Google’s Site Reliability Engineering methodology) or other techniques to maintain low tech debt to achieve this; as well as collaborate with users on choice of business requirements (e.g. whether really need real-time updates vs near real-time) that potentially reduce complexity and facilitate resiliency
  • DESIGN for capabilities such as:
    • Availability - through applying relevant best practices for redundancy, distribution etc
    • Recoverability/Maintainability - through modularity and flexibility
    • Performance - through relevant scaling and optimised design for good performance, as well as designing to avoid bottlenecks 
  • BUILD & TEST the services using best practices such as "shift-left" development; thorough automated functional and non-functional testing including relevant scaling tests, performance tests, failure recovery tests and observability tests; use of safe progressive releases
  • OBSERVE through relevant observability and monitoring tools and best practices including monitoring for infra, applications and users; also, use of AI to speed up problem diagnosis and resolution
  • OPERATE to ensure high resilience through best practices such as ongoing proactive risk management, continuous improvements (e.g. chaos engineering from Netflix e.g. via AWS Fault Injection Simulator, tuning, capacity planning & upgrades, Google SRE’s Elimination of Toil), good incident/crisis management plans and processes (including Google SRE’s Blameless Post-Mortem)



Fees & Subsidies

 



loading

Certificate

The ISS Certificate of Completion will be issued to participants who have attended at least 75% of the course and pass the required assessments.




Preparing for Your Course

NUS-ISS Course Registration Terms and Conditions

Find out more.

NUS-ISS and Learner’s Commitment and Responsibilities

Find out more.

WIFI Access

WIFI access will be made available to participants.

Venue

NUS-ISS
25 Heng Mui Keng Terrace
Singapore 119615

Click HERE for directions to NUS-ISS

In the event of a change of venue, participants are advised to refer to the acceptance email sent one week prior to the commencement date.

Course Confirmation

All classes are subject to confirmation and NUS-ISS will send an acceptance email to participants one week prior to the commencement date. Confirmed registrants are to attend and complete all lectures, class exercises, workshops and assessments (where applicable). Additionally, all responses to feedbacks and surveys conducted by NUS-ISS and its partners must be submitted. All training and assessments will be delivered as described in the course webpage.

General Enquiry

Please feel free to write to ask-iss@nus.edu.sg if you have any enquiry or feedback.




Course Resources

Develop your Career in the Following
Training Roadmap(s)

Please click on the discipline(s) to view the training roadmap of related courses to assess your training needs and goals.

Software Systems

Architecting the backbones of smart cities

Read More Software Systems

You Might be Interested in...

A+
A-
Scrolltop
More than one Google Analytics scripts are registered. Please verify your pages and templates.