- Time: T/Th 14:00 - 15:20
- Room: Livingston BE-013
- Canvas Link: CS 672
- Instructor: Minesh Patel
- Office Hours: After class or by appointment.
- Office Location: CoRE 310
Announcements
- 1/16: [UPDATE @ 9:15 am] The first lecture will be on Zoom (link in Canvas) due to the ongoing extreme weather advisory.
- 1/16: First day of class!
Overview
This course exposes students to reliability challenges and practices in modern large-scale systems, including cloud, data center, and supercomputing platforms. The idea is to provide a strong technical background to pursue research, practical application, and/or further study in building robust systems.
We will look at how systems fail and recover, broadly touching on reliability and security topics and their implications on the sustainability of large-scale computing. We will explore relevant case studies centered on current challenges for production systems, reviewing both state-of-the-art techniques and recent academic proposals.