Cattle not pets

If you are running a stateless scale out application the cattle not pets mantra will be stuck in your thoughts. Sick processes and whole container environments that misbehave are taken out back of the server and executed. For the benefit of those handling on call support and for immediate recovery the easy button is great. However, it ignores the real problem. If you had a leaky roof the long term solution is not get more buckets to catch the water. To be clear I am talking about non functional program problems, if your login doesn’t handle case sensitive input debug that in test. We are talking about the application that passes all your automated and manual testing goes catatonic or nuclear in production.

What could go wrong…

The failure/replacement rate could exceed capacity
- Slaugherhouse overload
Failure of failure detection
- Zombie apocalypse
Continuous improvement blind spot
- All evidence of the bug was lost when the sick process and container was eliminated
They said reproduce it in test environment
- rolls eyes! Sure, production is just like test
- Put a check in this space _ if all firmware, hardware, and software is the same between test and production
In the absence of data
- Development “Tunes the application” to fix the issues
  - Meaning, We added worse bugs that we know how to fix

Clearly we should have the ability to gather information when warranted. When is it warranted? Your situations will vary. I suggest some practice before you find yourself on a call with a 2-1 manager to engineer ratio in either of the zombie apocalypse or slaugherhouse overload scenarios. Practice before the main performance is recommended.

How to gather information

Gain observability without blowing up the container
- What limits trigger the failure action
  - What commands can you run without triggering the reaper
- Where can you save output for later research

Warning: The current state of GNU Linux process inspection tools has many man pages with vague warning about limitations or that tool no longer works. See man pldd(1) for example.

Familiarize yourself with Brendan Gregg’s performance material and ensure that it can be used in your environment.

Run the app on a system with DTrace for more visibility.

Have a conversation with the system

Tell me about yourself
- cat /proc/cpuinfo
- free -h
Are you feeling ok
- dmesg
- vmstat 1 10
- netstat -tcp
Does you feeling stress today or have you given up on computing
- DTrace
- perf

Have a conversation with the process

Tell me about yourself
- ps -C {process name} -T –sort s -o pid,lwp,s,nlwp,pcpu,cputime,stime,ucmd
- ls -l /proc/{pid}/cwd
- ls -l /proc/{pid}/fd
- sed -z ’s/$/\n/' /proc/{pid}/environ
- ls -l /proc/{pid}/maps
What are you doing
- pstack
- DTrace
- perf

Ask the process tough questions

Do you have any hangups
- For example in java
  - kill -3 {pid}
  - jstack
gcore

Endgame

Given enough data on the bug state try to create the scenario in test. Failing that add additional tracing in the suspected areas.

Process Slaughterhouse

Cattle not pets

What could go wrong…

How to gather information

Have a conversation with the system

Have a conversation with the process

Ask the process tough questions

Endgame

Tags

Recent posts

Alter system calls with LD_PRELOAD

Mount a block device image

DevOps notes

Linux kernel module basics

zpool reimport

Archives