No, this wasn’t a pasting accident in the title. No, it’s probably not grammatically correct. I was putting our toddler to bed the other day and the book she picked out was one about a little blue train. The book was well worn and the little blue toy train that I imagine came with it was missing. I had to use my finger and imagination to guide this invisible train through the tracks that followed it’s winding journey on each page.
This train story was a perfect analogy to life in general and the need for a positive outlook (see “Have to vs. Get to”.)
In the past, I’ve been put in several scenarios where success was not a foregone conclusion and things looked incredibly bleak. I’ve heard “impossible” and “no” far more than I would like to admit. One situation I recall vividly was when our company’s internal performance management SharePoint web site started crashing at the worst possible moment – the week of our yearly performance reviews. It would crash a server at a time and stay down after about 20-30 minutes when more than just a handful of people would try to use it to finish people’s evaluations in this mad rush week.
What’s interesting is that I was one of the last people that should have been able to help with the issue
- I didn’t even work in that department
- I didn’t write SharePoint apps
- I didn’t write anything even related to the performance management application
I was a Software Architect, and I designed, wrote, and debugged other web applications so that appeared to be enough credentials to take a look.
What I do possess are a certain set of skills for attempting to fix difficult challenges, along with a never-give-up attitude. I once read that an “expert” in software problem solving was related to a factor of their experience and faster ability to rule out possible non-answers without having to fully follow that decision tree. Like “Mr. Wolf” in Quentin Tarantino’s classic movie Pulp Fiction, when there are brains in the backseat, my phone often starts buzzing.
It surprises me the lack of scientific rigor seen in most computer triage and debugging scenarios. I used a simple method of ruling out things that it couldn’t be and gathering evidence of things it could be. One of the first stops along this journey was to learn as much as I could about the symptoms. I showed how we could use Performance Monitor to watch the memory usage and thread counts of the SharePoint process. When the process would crash after about 20-30 users, it left a memory dump file around. This dump referenced pointers in the C++ code, a language that I didn’t write and honestly only had a basic, textbook-based understanding of. Through the power of Google and some dump analysis tools, it pointed me towards a function which was throwing an exception. That internal function appear to be related to working with “AD”. This lead me to check some basic things out – the machine appeared to work normally, DNS entries worked, and it boot up just fine. One interesting and ultimately critical “quirk” we discovered was that it would lock up for nearly 2-minutes when you tried to give someone permissions to a folder or any files including the desktop via Active Directory.
Within an hour with Microsoft’s urgent follow-the-sun support department, we had the symbols for SharePoint source code telling us a bit more about the named method and its purpose in authorization lookups (vs. authentication) for domain users. Many people still insisted that it wasn’t an AD or AD-accessibility problem. Rather than accept that something may possibly be wrong with “their” system, they would say it had to be application related because people could still login to the machine, ignoring the glaring machine hang issue. They pointed out other servers in other datacenters that could set security permissions just fine – that this issue with these servers must still be app-related.
For four days, day and night, I would join triage calls and help guide people to the information we had already retrieved about the issue within the first hour.
The root cause of the problem turned out to be related to a team’s seemingly unrelated upgrades to AD servers and a missing change needed to the firewall. In Window 2008 AD, there are fixed privileged ports (0-1024) that AD communicates authorization calls on. In post-Windows 2008 AD, they use a random set of high ports (>~50,000) which needed to be opened. The AD team had recently upgraded some of the AD servers over to Windows 2012, but neglected to open the right ports. This lead to timeouts and exceptions when the application, using SharePoint functions, tried to authorize certain users, culminating in a thread exhaustion problem within 20-30 minutes after a restart.
Sadly, it took four days with Microsoft and our internal teams to open the needed port on the firewalls. I would persistently suggest that we open the firewall ports wide open between these machines and the AD to rule out that as an issue. Each time, I would be told that I didn’t know what I was talking about. I would wait another day-cycle to suggest it again, providing more research towards this cause. Countless Wireshark traces and other pieces of imperial evidence pointing at the AD link would need to be provided to several key people (that are no-longer at the company) before they would finally admit to the mistake in configuration and before it was easily resolved with a small firewall rule change.
What the process told me was that even though I was not an AD-expert, I was not a C++ programmer, that I had never written a SharePoint widget or web-part myself, that a positive mental outlook, an analytical approach, dogged persistence in the face of unwilling detractors, and an unending desire to help achieve the goal will ultimately prevail. I could have easily stopped digging when they said it wasn’t “AD” or connectivity related, but I didn’t give up.
Like a little blue train that doesn’t give up, having the right attitude and willingness to try to help in an area I was unfamiliar with not only prove successful, but taught me new things about C++ and how to debug those applications. Since then I’ve been incredibly interested in how to use subtle influence to help others see past their implicit biases and to help them adopt a more pragmatic approach to problem solving.
Ultimately everyone’s performance reviews, for good or bad were completed the week after. I received recognition for the tireless efforts in identifying the actual problem and persistence in joining each call for the four days.
I wonder what the reviews looked like for the people who gave up looking at their areas as a possible issue during hour one?