Whenever you are building and deploying a complex system, there are always going to be bugs, defects, and unforeseen problems with usability — commonly referred to as Production Support issues. Today, our ScrumMaster and their Team grapple with these issues, to help you understand how they affect a Scrum Team and what you can do to prevent them from dragging you down.
ScrumMaster Steve and his Team have released the World’s Smallest Online Bookstore. It’s live, and books are selling and shipping. This turns out to be a blessing and a curse because the business is making money, but with it comes support issues.
In the first two Sprints after the release, the Team noticeably struggles and often fails to meet its planning commitments. At first, Steve is okay with this, believing it’s inevitable post-release hiccups but, when it’s clear that the trend is continuing into the third Sprint, he starts to get worried. He spends some time watching the Development Team work and notices that Team members are often interrupted several times a day. Most of these interruptions are from production support issues.
At the company developing the World’s Smallest Online Bookstore, they have a chronically understaffed helpdesk group who provides two lines of support. They primarily respond to and resolve customer queries, known as first-line support. When an issue can’t be resolved with a simple phone call, chat, or email, they investigate to see if the problem is a defect or the platform being used in unexpected ways, known as second-line support.
It is this second line support that is causing frequent interruptions, as whenever they find a defect or usability issue, they submit a production support ticket to Steve’s Team. To document this, Steve adds a “swimlane” at the bottom of the Team’s Sprint Backlog board, labelled “Production Support and Other Interruptions.”
After four or five Sprints, he charts the resulting effect on the Team’s Sprint Burnup (Burndowns and Cumulative Flow Diagrams could be used as well), to show the correlation between Production Support Issues each Sprint and the number of Backlog Items completed.
During the next retrospective, Steve chooses a timeline to help the Team review what has happened.[1] He also uses the data from the burnup chart to help them see the resulting effects. They discover that it’s even worse than Steve realized – Team members aren’t just being interrupted, they’re spending most of their time handling support issues.
How to stop production support issues from disrupting the Team
The first line of defence, before the Team even sees a support issue, is the Product Owner. Paula should take a look at a support issue first, then ask herself, “Is this problem worth interrupting the Team during this Sprint?” If the issue, or even a defect, isn’t absolutely critical to delivering value to the customer, it may be best addressed by adding it to the Product Backlog and asking the Team to address it as a User Story in a later Sprint.
The Team can also do a number of things to mitigate interruptions:
- Choose one Team member each Sprint who will become the primary point of contact for support issues. This member would involve others on the Team as needed. This job should rotate, so it’s not always the same person. Note: I have, on occasion, called this “Rotating One Victim Per Sprint Strategy.”
- If working with multiple Teams, in any one Sprint have one Team handle all production support work. Each Sprint the Team doing the work changes.
- Wait until the end of the workday before interrupting another Team member for help on a support issue.
- Monitor production support issues and track their effects on the Team’s productivity. I.e. in the days following the resolution of a support issue, measure the cost of fixing the issue in terms of fewer User Stories completed, but also in associated costs, such as time deploying fixes and effects of multitasking on productivity.
- Based on historical needs, acknowledge in planning that there is a “tax” of 20-30% of Team capacity needed for production support. Note: This is my least favourite solution because it doesn’t help make the situation better, it just increases the Team’s awareness of it.
- Extreme version of this solution is Pattern: Illegitimus Non Interruptus – cancel the Sprint if interruptions exceed the timebox. Note: This works well to highlight the problem, however, I would only recommend it as a last resort.
- Limit the number of production support items “in Progress” to one. This forces hard decisions to be made about the value of fixing an item immediately but limits the ability to fix another one before the first is finished.
The obvious problem is that these defects detract from the Team’s ability to finish User Stories in a Sprint. However, the problem is much deeper than just that. When any interruption happens in a Sprint, the Team loses focus. When they lose focus, it harms the quality of the work they were doing on the new Stories.
These band-aid solutions help the Team keep going, minimizing harm, but they fail to address the systemic issue. In any environment where a problem occurs, there are always going to be multiple and interrelated causes and effects – there is never just one root cause or single source of problems.
“In accident investigation, as in most other human endeavours, we fall prey to the What-You-Look-For-Is-What-You-Find or WYLFIWYF principle. This is a simple recognition of the fact that assumptions about what we are going to see (What-You-Look-For), to a large extent will determine what we actually observe (What-You-Find).”[2]
Instead of finding causes and eliminating problems – which suffers from hindsight bias[3] – everything is clear looking back. We run experiments that we hypothesize will improve the situation (Unit Testing, Test-Driven Development, Behaviour-Driven Development) and look to see if the data from these experiments supports this outcome.
However, to permanently solve the problem requires a mindset change. In a real-world example, groups at Microsoft have done this by targeting Zero Defects.[4] By making defects the exception and not the norm, people start building habits earlier to avoid them altogether. You will never get to zero, but you should find changing the mindset around defects has a radical effect on quality.
Small changes can overcome big, systemic problems
It took the Team over 20 Sprints (almost an entire year) to get to this state, so they’re not going to resolve this problem in a few Sprints. Steve, knowing this is going take a long time for the Team to grow past, suggests that the Team make a few small commitments during the next Retrospective that they think will help. The Team decide:
- There is a limit of one Production Support issue worked on at one time. When these are being worked on, at least two members work together (pair program) to find the simplest fix.
- All new code has acceptance tests written for it at the time it is created.
- When Production Support Issues are found, an acceptance test is created that demonstrates the defect.
- They set a goal every Sprint of having zero net new defects, agreeing to drop Stories rather than push new defects out of the door.
This a good start to turning the quality issues around. Six months later we check back in with the Team – they have reduced their support workload to ~20% of the time (still not great, but a far cry from the 50%+ they had before), they have slowed down, and with renewed focus on quality are finally delivering on zero net new bugs every Sprint. The original goal of Acceptance Tests for all new code wasn’t achieved in the first few Sprints after it was suggested. It took the better part of the next two months, but they would eventually get there.
Scrum by Example is a narrative-style blog series designed to help people new to Scrum, especially new ScrumMasters. If you are new to the series, we recommend you check out the introduction to learn more about the series and discover other helpful articles.
Do you want to coach your team towards better solutions?
The above scenario, and others experienced by our well-intentioned but often misguided fictional ScrumMaster Steve and Team, are topics commonly discussed in our training workshops. Fundamentally changing the way a Team works as it transitions to practicing Scrum is one of the more challenging issues that organizations face in their Scrum evolution. We would love to have you join us to receive hands-on learning of the challenges – and solutions – as well as tips on how to integrate them into your Scrum.
[2] The ETTO Principle: Efficiency-Thoroughness Trade-Off: Why Things That Go Right Sometimes Go Wrong by Erik Hollnagel
[3] https://en.wikipedia.org/wiki/Hindsight_bias
[4] https://docs.microsoft.com/en-gb/archive/blogs/ericgu/no-bugs-journey-episode-2-its-a-matter-of-values
Mark Levison has been helping Scrum teams and organizations with Agile, Scrum and Kanban style approaches since 2001. From certified scrum master training to custom Agile courses, he has helped well over 8,000 individuals, earning him respect and top rated reviews as one of the pioneers within the industry, as well as a raft of certifications from the ScrumAlliance. Mark has been a speaker at various Agile Conferences for more than 20 years, and is a published Scrum author with eBooks as well as articles on InfoQ.com, ScrumAlliance.org an AgileAlliance.org.
Vaishali says
Thank you. This is a piece of good information to start with. I am looking for guidance, information to start a production support team with the shared resources of the development team.
Mark Levison says
Viashali – you’re welcome. We have more material in our reference library: https://agilepainrelief.com/scrummaster-resources-and-references#maintenance – with a caveat, in the next few months the library will evolve (its grown too large for the original container) and so links like this will break. The material will still there, but the link won’t work any more.
Shalini says
Hi
Few teams shift from scrum to kanban approach once the product in live/in production so that they can accommodate the production defects. Please let me know the pros/cons of this.
Thao says
Hello, this is a great article with useful content from the many that I have come across. I would like to receive info on whether a production build should be held or pulled back to an earlier release while the bugs are being fixed, or do we leave at the current promoted build and fix the bugs at the same time. Then do we route in the fixed production for validation in the lower environments which are at a higher build?
Mark Levison says
Thao – interesting. I think all the options can make sense in different circumstances:
My leaning, keep deployments small – once a day (or more frequent), at most once a Sprint. Then rollbacks are lower cost, because less is being taken out of production. This also has the advantage that each deployment is less risky. In that case I would only leave minor defects live in production.
foaud says
Hello,
it’s a great article with useful content.
I am preparing to take the Scrum Master certification and start working as a Scrum Master, so the Scrum articles for example help me understand their reality.
My question is this: when the scum master notices that there is something wrong with the team, they are working in the wrong ways, or they are influenced by the external team, in this case the support team.
What position should the scrum master take? Should he give a solution, or just ask a question and observe.
I know that the scrum master must observe and raise the problems to the scrum team and ask the right questions to the team to push them to find a solution (coaching). Is this always the case in all situations?
Thanks for all your efforts.
Mark Levison says
Foaud – As usual in the game of Scrum there are several stock answers: It depends and Ask the team. The later is the most important piece of Scrum advice, the team know more than we do.
Now a more serious answer. I see the ScrumMaster’s role is to help the team to become more effective in the long run. If you solve their problem right now, they will learn the the ScrumMaster is good at solving their problems. I would prefer the team learn to solve their own problems. So my preference is to ask questions and observe. If you’re worried that things will blow up then make it safe to fail, help them find an approach where failure can easily be unwound.