DevOps Notes

Three ways

• Systems Thinking
• Amplifying feedback loops
• Culture of continuous experimentation and learning

Systems thinking

Think of the overall outcome when optimizing parts.

Amplify feedback loops

Observe the output changes and adjust the input causing it quickly

Continuous experimentation and learning

Try things out and see what works. Avoid analysis paralysis with working code. Try new ideas and encourage sharing.

DevOps methodologies

People over Process over Tools
Continuous delivery has a quick feedback cycle for small batches of: • code • build and automated test • delivery
Lean Management • Work in small batches • Work in progress limits • Feedback loops • Visualization
Visible Ops-Style Change Control • Eliminate fragile artifacts • Create a repeatable build process • Manage dependencies • Create an environment of continuous improvement
Infrastructure as Code • System treated like code • Checked into source control • Reviewed, built, and tested

10 Practices for DevOps success

• Purposeful chaos and recovery 
• Blue/Green deployment on identical live/offline systems for quick forward testing and rollback
• Dependency Injection (or service discovery and other similar patterns of decoupling)
• Andon Cords (anyone on the line can stop the process to work on a problem)
• The Cloud (api driven way to create and control system infra as a program)
• Embedded Teams
• Blameless Postmortems (there is never a single cause of an incident)
• Public Status Pages (see transparent status pages document)
• Developers on Call
• Incident Command System

The wall of confusion

Teams are often setup to work in ways that conflict and encourages conflicts of interest and limit cooperation to “perform”

Blameless Postmortems

Do it within 48 hours of incident
Have the team build a timeline
Have a third party run the meeting Acquire the human and system events in a Timeline, preferably UTC if multiple timezones involved.
Describe the incident
Describe the root cause
How was the incident resolved
Timeline of events and actions taken
How were customers impacted
Remediation actions

Rules for communication

Admit failure
Sound human
Have a communication channel
Be authentic, have someone find the stakeholders and communicate the issue.

Don’t add process unless everyone thinks it is needed. Remove process that is no longer needed. The more process walls you have the more Conway’s law will rule and progress will slow.

Kaizen

Plan → Do → Check → Act

Go see for yourself where the issue is reporting, reports and metrics are useful but go see the real system

5 Whys

Why did it happen
Why did that happen…
Look for underlying causes not symptoms. Look for real causes, not enough time or human error is not an acceptable answer.

Infrastructure as code (IaC)

Treat systems like code, use editor, version control, automated testing and so on.

CMDBs

Suggestion to move towards containers and orchestration without app config and state embedded. Use an external configuration system or discovery system to separate container image and config for flexibility.

App configuration & orchestration tools

• etcd
• zookeeper
• consul

Orchestration

• Mesos
• Kubernetes
• Docker Swarm

CD Test patterns

Early and Rapid feedback is the goal. Tests should run fast and reliably so developers get good feedback on changes. Unreliable and slow tests must be dealt with. Deal with slow tests by moving them to another process such as nightly full test if required.

Test Driven Development TDD: Write test then develop code to reach desired state
Behavior Driven Development BDD: Work with stakeholders to document desired state then write code to meet. Cucumber and GWT are examples of a DSL and System to execute the DSL.
Acceptance Test Driven Development ATDD: Write tests from the users perspective.

Build tools

All the usual tools but the goal is it runs automatically by developer on desktop and as a requirement on check in and is reliable and appropriate for the need. Avoid waste, get something that does the job and does it quickly. Unused features are unneeded features.

Failure patterns

Many failures are due to integration points. Example: Cascading outage – countered by circuit breakers Deviations from 12 factor applications

Lean approach in a nutshell

• Build
• Measure
• Learn

repeat

Security areas

• System
• Application 
• Application events 
• Anomalies

Logging

• What
• When
• Where did it happen
• Who
• Where did it come from

SRE Tools Topics

In general, tools that can be used by multiple teams are more useful. Collect distributed logs into a central, or logically central, view for monitoring, reporting and analysis.

Splunk, ELK Pagerduty, Viktorops, Flapjack Status pages and metrics automation Security monitoring tools

Conferences

• DevOpsDays
• Velocity
• DevOps Enterprise Summit

Books

• Visible Ops
• Continuous Delivery – Farley & Humble
• Release IT!
• Effective DevOps – Davis & Daniels
• Lean Software Development 
• Web Operations – Allspaw
• The Practice of Cloud Systems Administration
• The DevOps Handbook
• Leading The Transformation – Mouser & Gruver
• The Phoenix Project – new version of Goldratt’s “The Goal”

Ongoing learning

devopsweekly.com @garethr
devops.com @devopsdotcom @ashimmy
devopscafe.org @botchgalupe @damonedwards
DevOps Audit Defense Toolkit
Rugged Manifesto

DevOps skills

T Shaped individuals, deep in one area but know enough about the rest of the whole to work with all teams.

DevOps notes