Hprc banner tamu.png

User:J-perdue:20180406

From TAMU HPRC
Jump to: navigation, search

Best Practices in System Administration

Intro to the Intro

Talk given at the 2015 Texas Linux Festival about HPRC framework

Intro

About me

  • Introduced to computers in 1978. Spent four hours a day after hours during high-school learning about computers on an Apple II. Fell in love with a device that would do things for me once I told them how to (correctly) do it once (I'm lazy and doing matrix multiplications isn't something I get a thrill out of doing... plus I, as a human, invariably get it wrong sometimes).
  • Came to TAMU in Fall 1982. Generally unimpressed with TAMU computing resources at the time. Bought an IBM-XT in 1984 while a programmer at a firm in my hometown (Kingwood, TX). (previously had a part-time job running backups and doing clerical jobs at night... after that was done, I learned to program their systems)
  • Held a number of computer-related jobs.
  • Earned a BS in Computer Science from TAMU in 1996
  • Earned a MS in Computer Science from TAMU in 1998
  • Worked on a PhD for a number of years.
  • Was a graduate research and system administrator for the Parasol group from 1997-2010
  • Joined the TAMU Supercomputing Facility (now TAMU HPRC) in March of 2013.

Overview of this talk

Be proactive

  • Being "reactive" is MUCH more difficult if you don't have a plan... especially when there are people counting on you to solve problems when they occur (see Murphy's Law below). If you haven't prepared for a problem before it happens then somebody is going to be unhappy when it happens.
    • Note: There is such as thing as overthinking the problem or being too proactive... e.g. if a thermonuclear missile hits the machine room, then we will probably have bigger problems than getting the HPRC clusters back on line. Similarly, there is no point in trying to install the latest and greatest software if nobody is going to use it yet (I may be guilty of that). This just requires some judgement and how to prioritize available people-power/people-hours).
  • e.g. Deep Neural Networks - e.g. cuDNN, TensorFlow, OpenCV, Caffe, PyTorch

Murphy's Law

""stuff" happens" - (Forest Gump). Best to be prepared for it when it does (see above).

KISS

  • Keep It Simple Stupid - lack of complication makes it easier to resolve issues when new things come along or Murphy's Law kicks in
    • e.g. writing up a wiki page instead of a full-blown PowerPoint presentation when the former will suffice :)

Don't reinvent the wheel

A possible corrallary to KISS. [1]

RTFM

Read The "Fine" Manual

Automate what you can

Centralized Administration

Cluster Deployment

Manual

Rocks

xCAT

Daily operation

  • Puppet
  • Chef
  • Ansible
  • cfengine - (brief tour of HPRCLAB workstation configs)

Software Management

Precompiled packages vs locally built software

Precompiled packages

Windows

Debian/Ubuntu

Redhat/CentOS/Fedora

Linux x86_64 binary files

Limitations

Locally installed packages

Installation

DO NOT BLINDLY INSTALL EVERYTHING TO /usr/local

Version management

Modules

Documentation

"Self-documentation"

Manual documentation

Disasters

Possible disasters

  • thermonuclear missile hits machine room
  • terrorist blow up machine room
  • fire envolopes machine room
  • disgruntled employee thrashes the machine room
  • flood (knocking out all power)
  • earthquake (not typical but possible given the increase of the use of "fracking")
  • etc.

In general, things that would take out our entire infrastructure if not the infrastructure supporting our infrastructure.


Redundacy

These may seem antithetical to "KISS" given the work needed up front to make them work well but in the end will make your life simpler.

Replicated filesystems

Fallback servers

Networks

Power

Backups

Local media

Remote backups

Recovery Plan

Security

Limiting access

Monitoring

Audits

  • "we were told the data was the deleted" -- Sheryl Sandberg, Facebook, April 5 2018

Summary

https://www.theregister.co.uk/2018/04/05/this_damn_war_power_out/