Tom’s Rules

OK, these aren’t really Rules.  They’re more like Guidelines.  Or lessons learned the hard way, over almost 30 years as a sysadmin.  Some of these I stumbled over myself, others came from former colleagues, friends and other people I respect in the industry.  Some are universal, some only apply in certain situations.  Many were inspired by particularly spectacular failures.  Some were inspired by successes.  All have exceptions.  There’s a story behind each one.  Most should make you think at least a little about how you do your job, interact with your peers and your customers.  Your mileage may vary.

In no particular order.

      1. Crypto is Easy, Key Management is Hard.
      2. Never change anything on a Friday.  Really don’t change anything on a Friday before a three day weekend as school is near letting out for the Summer when all those students will add to the throngs of people visiting your sites.
      3. Disk is Cheap.  Real Storage (with support, performance, redundancy, failover, backups) is not.  Just Disk is so cheap that it will eventually cost you much more than its purchase price when it fails and you lose critical data, or spend hours or days recovering it or recreating it.  So, Disk isn’t really cheap, is it?
      4. My systems will be designed with enough capacity and redundancy to survive failure or attack for at least 8 hours.  I need my sleep, and so does my team.
      5. An unused Cray “compute cycle” that isn’t used is lost forever.
      6. A log message not created or captured at the moment can never be recovered.
      7. The log messages that you decide not to collect and archive today will be the log message you need to diagnose the problem or track the intrusion next month (or next year).
      8. Smart, capable and interesting people know smart, capable and interesting people.  Use this to your (and your team’s) advantage when interviewing and hiring.  It also makes a happier and more interesting work place.
      9. When there is a multi-zone fire alert in the datacenter,  and you can’t find any smoke or fire, check the air handlers themselves.
      10. Many datacenter EPO switches are either dummies, or were installed three power system renovations ago, and don’t actually turn off all the power in the room.
      11. Clean, near perfect datacenter cabling takes longer initially, but pays for itself in the long run through fewer failures, less hassle, and happier sysadmins.
      12. Skippy is the undisputed master of Internet lists.
      13. Very, very few organizations spend enough on security or disaster recovery until it’s too late.  Then it costs more.
      14. New syadmins follow directions.  Better sysadmins solve problems. Great sysadmins anticipate problems and engineer them out of the system before it is launched.
      15. A System Administrator should be able to: install a server, patch a kernel, create a RAID group, backup (and restore) a system, plan and build a network, configure a firewall, fix email, use at least three operating systems, debug a SCSI chain, write a script, install a database, replace a switch, mentor a colleague, explain what they do to their manager, find out what a customer really needs not what they want, fix “teh Internet”, investigate a security incident, and never, ever panic. Specialization is for insects.  (with apologies to Robert A. Heinlein)
      16. All software sucks.  Choose the software that sucks in a way that you can live with or work around.
      17. All hardware sucks, too.  Some just sucks a little less, or has good enough support that you can tolerate or accept that level of suck.
      18. The fabric of the Universe is far from perfect.  Accept that, and route around the damage.
      19. No company, job or project is perfect.  Choose one that has the positives that are most important to you, and has problems that you can either tolerate or fix over time.
      20. A Vendor has a list of products that they want to sell to you or anyone else that they can snare.  A Partner has some “skin in the game” and their success is tied to yours.  Partners want to help you solve your problems and help you be successful, so they can be successful.  Partners will help solve your problem, even if it means they help you find a competitor’s product and they don’t make a dime.  Partners are rare and exotic creatures.  When you find a good one, nurture the relationship and be successful together.  Treat Vendors the way they treat you, as a a commodity to be acquired as cheaply as possible, used as long as needed and to be discarded whenever you’ve outlived your usefulness.
      21. Beware the user who brings you a Solution instead of a Problem to be solved.  Investigate carefully.
      22. Every developer should read Mythical Man Month.  Every sysadmin should read Slack.  Every manager should read Peopleware.  Everyone should read, dammit!
      23. Not everything is a Process.  But many things are, or should be.  If it is a Process, it had better be repeatable.  And not just by the person who created it.
      24. A Sysadmin who only knows one operating system is like a person who can only drive one specific model of car.  Two Linux distributions aren’t two OSes.  Windows 2003 and 2007 aren’t two OSes either.
      25. It is easier to direct and channel a fanatic than to animate a corpse.  You can influence someone who has a deep and abiding passion, you can’t install that passion in someone who just doesn’t care.  Cultivate, encourage and lead your fanatics.
      26. 100% uptime on an individual server was cool and a great thing to aspire to…. in the 80’s and 90’s.  How could you have patched a server that was up for 1000 days?
      27. Uptime for a service, however, is a wonderful goal to aspire to.  That service should be redundant, fault tolerant, scalable and secure.  You should be able to do “rolling upgrades” or patches on the individual resources behind that service, without impacting the service.
      28. Don’t feed the trolls.
      29. If you don’t have a ticketing system, you’re doing it wrong.  If you have a bad ticketing system that you or your users hate, or makes it harder to get things done, you’re doing it wronger.
      30. Emacs vs vi.  csh vs sh. Windows vs *nix (or MacOS).  These are the great unresolvable questions.  To bad the answer doesn’t really matter.  All have their places.  None can do everything.  Except maybe Emacs.
      31. Every sysadmin will come to a time when they aren’t getting as much as they used to out of training or specific talks at conferences.  That’s the time that you should begin giving training, writing more papers and giving talks at those conferences.
      32. The hallway track is often the most valuable time spent at a conference.  Face to face communications is so high bandwidth and the opportunities for immediate collaboration are so incredibly valuable that it will always be time well spent.
      33. Always mount a scratch monkey.  Documentation is important.  People need to know the “why” of a thing, not just “do this” or “don’t do that”.
      34. Rule 34 is taken.  Really.  Don’t go there.  Here be Dragons.
      35. Validate all input.  The lesson of little “Bobby ; Droptables” is only the most recent incarnation of this.  This lesson was important when it was first enumerated in the 60s, and it’s a lesson that our industry still struggles with.  Assume all input is hostile until proven otherwise.  Trust but verify!
      36. Scale or die.  Automation, like voting, should be done early and often.  Planning for scale is one of the ways to plan for unexpected change, deal with massive success (!), and keep yourself sane.  If you automate things while you’re managing just a few systems, you’re more likely to survive when you suddenly grow to managing hundreds or thousands.
      37. Just because you can do a thing, doesn’t mean that you should.  Just having the technical or operational capability to be able to achieve or deliver something doesn’t guarantee that it is needed or even a good thing to do.  Always ask yourself “why am I doing this?” and if you don’t have a good reason, then Just Don’t Do It.
      38. Access does not imply authorization.  Just because you have “root” or “Administrator” or “SYS” or “admin” doesn’t mean that you are authorized to use those privileges for anything you like.  Those privileges are a “sacred trust” and must only be used for functions within your job responsibilities.  Randomly examining user file contents, sniffing individual network ports or scanning system logs “just to see what people are doing” is a violation of that trust.
      39. Outsourcing is not a panacea. It does not always result in lower costs, or higher quality. It can often be more difficult than doing it yourself. Make sure you know what value you want from outsourcing, and then make a decision. Then revisit that decision. And one more time.
Advertisements
  1. #1 by Scott Nyce on June 9, 2011 - 7:12 pm

    And of course one of the most important lessons you ever taught me: “Just because you can, doesn’t mean you should.”

  1. Tom’s Lessons Learned « Thuktun (Message)
  2. Rule #29

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: