Class Format
This class will take place in person on campus. Active participation from students is expected and will be graded.Prerequisites
I recommend introductory familiarity with computer architecture and statistics to get the most out of this course. Prior programming experience building, running, and debugging C/C++ projects will be helpful for assignments and the final project.
Please reach out to me if you have specific questions or concerns.
Textbooks
No books are required. Course material will be primarily based on research articles. Links for references and readings will be provided through this website.
Optional textbooks that we will reference include:
- Heather Adkins, Betsy Beyer, Paul Blankinship, Ana Oprea, Piotr Lewandowski, Adam Stubblefield. Building Secure & Reliable Systems. O'Reilly Media, 2016.
- Beyer, Betsy, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. Site Reliability Engineering. O'Reilly Media, 2016.
- Barroso, Luiz André, et al. The Datacenter as a Computer: Designing Warehouse-Scale Machines. Springer Nature, 2019.
Course Structure and Expected Work
This is an advanced graduate-level course comprising paper readings, discussions, and debates as well as assignments and a final project. Students are expected to participate by critically analyzing assigned paper(s) and expressing their thoughts and ideas during class discussions.
Expected coursework includes:
- Paper reviews: 1 required paper per class, possible additional reviews for extra credit.
- Participation: Preparing for and participating in in-class discussions and debates.
- Assignment 1: Programming assignment to gain some familiarity with tools used in reliability analysis and evaluation.
- Assignment 2: Writing a short "position paper" based on experiences from Assignment 1, in-class debates, and paper readings.
- Final project: Mini-research project involving programming, writing, and a 10-minute presentation.
Grading
- 10% class participation
- 5% in-class discussions (asking + answering questions)
- 5% in-class debates
- 15% paper reviews (one per class)
- 10% assignment #1 (programming)
- 15% assignment #2 (writing)
- 10% position paper
- 5% anonymous peer evaluation
- 50% final project
- 13% proposal report (at 0%)
- 5% checkpoint discussion 1 (at 33%)
- 5% checkpoint discussion 2 (at 66%)
- 15% final report (at 100%)
- 12% ten-minute final presentation
Class Participation
Class participation is essential because this is largely a discussion-based course. Students are expected to prepare for each lecture based on assigned readings and share their thoughts during in-class discussions.
Assignment 1: Programming
Reproduction of David Bacon's back-of-the-envelope calculations for silent data corruption.
Assignment 2: Programming
Implementation of fuzz testing by proxy using a software model of a 16-bit adder.
Final project
The project will be an open-ended mini-research project involving design, analysis, and/or implementation. I will provide a list of ideas to base your proposals on, but I highly encourage you to use them as inspiration for your own ideas, which typically lead to better motivation and results.
Details TBA
Exams
NonePolicies
Late Submissions: Late submissions will NOT be accepted for any reason starting Tuesday, 13 February. True emergencies (e.g., medical with proof) are subject to instructor approval.
Academic Integrity: Students should be familiar with the University's academic integrity policy. Any violations will be treated seriously according to the policy. Please don't cheat- it's not worth it.
Acknowledgments
This course draws inspiration from:
- [UIUC] CS 598XU - Reliability of Cloud-scale Systems
- [Drexel] SE 576 Software Reliability and Testing
The course website is inspired by: