How a production server got rebooted (twice!) in the middle of the day Tech Support |
- How a production server got rebooted (twice!) in the middle of the day
- We're a modern company.... Why are you laughing?
- "It can't be that bad"
- I can haz maths
- Database Support 14: Red Screen of Death
How a production server got rebooted (twice!) in the middle of the day Posted: 03 Dec 2018 09:40 AM PST This is a tale of the day a major production server rebooted, twice, without notice or warning. A couple years ago, I worked for a company that developed accounting software. I was responsible for administering our SaaS infrastructure. Because our SaaS infrasture was actually managed by an MSP, I did more babysitting of the MSP than actually administering servers. The characters in this story:
This story starts when I'm in the middle of writing an automation script to help improve the time it takes to onboard a new customer on our SaaS platform. I then get an IM:
I'm really focused and don't want to divert my attention, so I respond with the truth, I'm busy. A few minutes later my boss walks up to my desk.
I then open my IM window with $SupportTech and start chatting with him again.
Now the troubleshooting begins. First thing's first, let's log in to the server and check that the database service is running. Okay, the service is running, that's great! We also have a script that checks for basic configurations and other things known to break the database. Let's run that! Well, that looks okay. There are some outdated packages, but that's fine. We haven't ran updates on the server in a couple months and MSP regularly pulls down new packages in their managed repository. I'll make a quick note to schedule this server for patching. I open my IM window again:
Back to the server... Huh, that's odd. It certainly looks like $MSPMonkey1 rebooted this database server! To my e-mail I go:
I then pick up my phone and call $MSPRep. If I want a response in anything less than their SLA of 5 days, $MSPRep has to escalate it for me.
I send a quick e-mail to $Boss and $SupportTech with what I found, letting them know that $MSPRep will escalate the ticket I opened with them. I put my headphones on and get back to my automation script. Everything's going great until a few minutes later when my IM window starts flashing:
Let's hop on the server and see: That's not cool at all. Just as I'm about to pick up my phone to call the MSP support desk, I get a call from the area code of our MSP.
I walk into $Boss's office and tell him the story. We both simultaneously facepalm and have a good laugh over this and start talking about what we're going to tell our customers. Afterwards, I go back to my desk to continue working on my automation script. Later that day, I get a new ticket notification from MSP that looks like this:
I take a minute to read it and then everything clicks together. tl;dr $MSPMonkey1 rebooted a production server by sending CTRL+ALT+DEL to a blank VM console window, then proceeded to tell $MSPMonkey2 that you can type [link] [comments] |
We're a modern company.... Why are you laughing? Posted: 03 Dec 2018 08:42 PM PST I work for a trucking company not in IT. When I was hired, the HR person was impressed with my AA in InfoSystemsAdmin (yeah, I got a two year in Tech Support). "Well, you'll like it here, we've modernized." Yeah. Sure. Maybe in the main office where I'm not allowed. Not in the shop where I am.
So on top of all these things, we're supposed to be able to remotely connect to the trucks and see real-time audio and video in order to check the safety of the drivers. Current lag time is 43 minutes. Actually had one test drive FINISH and the driver return to watch the whole drive. I'm so frustrated with the situation that I've decided, after 9 months, to move on. Maybe the Amish are hiring... Onto why I've said all the above. For the third day in a row, our shop computer can not connect to the web. It's how we order parts and track trucks (to know when drivers are coming in). So we call IT. They suggest turning on the WiFi. We do not have the option. They don't understand and try to walk us through the steps. My co-worker, an ex-Marine Tanker, hits his limit. He leaves. I continue working with IT, getting no where. Since the other machines connect fine, I figure it's just an IP conflict (or maybe a dead network card). It's about this time, I hear the shop door close. My co-worker has brought an IT tech. Tech looks at the machine and says, "I see. I'll make a call." An hour later, we had a visit from an outsourced pro Troubleshooter. it was an IP conflict. [link] [comments] |
Posted: 03 Dec 2018 03:49 PM PST Someone made a comment in another post on provider - client interaction. "It can't be that bad." Oh? This was about 10 years ago, I was supporting our biggest and most important client. At the time we had a service that would try to identify fraud via purchase habits. When the service saw a potentially fraudulent transaction what happens beyond that point is largely up to the client. We could stop the transaction or let it go through. We could block the card or not. We could add it to reports in a variety of ways. We could notify the client via fax or secure email. The whole point was to stop fraud while causing minimal inconvenience to the end customer. Given enough data and some leeway for our risk analysts we could develop a pretty good strategy. The rate of false positives would be low, the losses from fraud would also be low, and we would reliably be able to find the point of compromise. It was a good product, I was routinely impressed by it. I especially liked how the service learned. If it flagged potential fraud the client was supposed to indicate if it was actually fraud. The service then incorporated that result into its modeling. Over time it got better at stopping fraud but also letting legitimate transactions go through. However, this required the client to collaborate with our risk analysts (the guys who modified the filters), the developers (who understood how the program works), and me (their primary contact who knew the program and acted as the go - between). And since this product can prevent some 100K's worth of losses annually there is a lot of incentive for good collaboration. You'd think. Anywho, we setup this VIP client on our standard filters while I tried to engage with them on an overall strategy. How aggressive did they want to be? What sort of trends did they see in their own data that I could pass on to the analysts? What process did they have to review the fraud notifications? They never did engage with me on that, but they did from time to time issue edicts based on a single case, which always did more harm than good. We had a loss at a gas station? Okay, stop all transactions of that type at a gas station! Now we have customers screaming at us they can't buy gas! Okay, authorize all transaction at a gas station! Now we have fraud again? Why doesn't your stupid program stop these criminals! etc. Occasionally they would go into the self-service tool and make changes themselves. At one point they accidentally changed a setting to deny all United States sourced transactions. They blamed us for that, we shouldn't have made it that easy. They wanted to deny all transactions outside of the US of course. In parallel to that, the client contact who got the notifications of fraud kept complaining that they were all wrong. I again tried to engage on that (uh what do you mean) and was ignored until the next complaint that they were all wrong. Repeat. Eventually my client contact agreed to a meeting, where I hoped we would finally be able to discuss and settle on an overall strategy. To make sure that happened I lined up a developer, a risk analyst (ditto), and a VP just in case. (After my experience with Bob I vetted these people. They wouldn't scoff at the client.) So the client comes on and I introduce everyone and turn the meeting over to the client side. My contact begins with pointing out that the notifications were very expensive. "Who is going to pay for all this paper?" Paper? Uh what? Over the next few minutes my contact explained that their fax machine goes through 100's of pages a day. And when they come back in over a weekend there was often close to a ream of paper sitting on the machine. I explained that those pieces of paper were notifications of possible fraud and we suggest that they review them. She said "Isn't that your job?" She was just throwing them out. At the end my contact sent me an email asking me to turn off the fax notifications. A few minutes later I did since she had enough privileges to request this change. She said they'll rely upon the secure emails, which I knew for a fact someone else at the client was ignoring because they were confused by the secure email interface (again I try to engage, happy to help you!). That contact also knew that we faxed them over too, so she didn't have to read the emails anyway. Someone else was handling it. I tried to explain the situation to both of those folks and a number of others but got nowhere. So to recap. No learning by the service, it wasn't getting better. Little or no review of possible fraud. No stopping a pattern of fraud when it was small, before big losses. No attempt to engage in a better strategy with the experts on our side. No attempt really to understand how the service works at all, but let's make company wide changes in production anyway. No understanding of their own internal organization and who was doing what. etc. Shortly after they turned off the fax notifications they had a major point of compromise and lost a fair amount of money. Which they blamed us for. Our product failed. Why did it suck so bad? I guarantee others have seen a lot worse. Gaur-Un-Teed. [link] [comments] |
Posted: 04 Dec 2018 03:53 AM PST Up late (early) with an earache, browsing the interwebz, and remembered this funny story from earlier today. Me: senior team lead for a region of a state government agency. Sally: nice caseworker. Scene: I'm visiting one of the offices I support to handle a trouble ticket and check up on hardware I hope to deploy soon. This office has a whiteboard the staff use for interesting things and games. There's been a sudoku puzzle on it for a few weeks that appears to have stalled, and I'm standing in front of it pondering the possibilities when Sally walks by and asks if I like maths. I tell her I'm OK with numbers and that I enjoy the occasional sudoku. Sally heads back to her desk but comes back a minute later with a pad of paper and a pen and asks my age and birth year, which I give her, wondering a little what she's up to. She excitedly jots down the two numbers, stacked one over the other, proudly strikes a line under them, sums the two and with great panache writes the sum under it all: 2018. She says something about "... only one in a thousand..." and I'm thinking "really?" but her enthusiasm is intense. Apparently someone had pointed out to Sally's mum that a birth year and an age will only sum up to the current year for a very small number of very special people, and Sally's mum had passed along this special knowledge to Sally, who had just then pointed out this special fact to me, and I was one of the special ones. I put on my friendly skeptic face and mention that after one's birthday and until the end of the current year, everyone's age and birth year sum to the current year because ones age is a measure of the number of years between your birth year and the current year, and full credit to Sally, her eyes gain that shade of recognition very quickly and she gets a little shamefaced. We had a good laugh, as did several others around the office. It was a funny little event that resulted in laughs all around and maybe a little sympathy for Pythagoras. [link] [comments] |
Database Support 14: Red Screen of Death Posted: 03 Dec 2018 02:42 AM PST Last time on Database Support: Fool me once, shame on you. Fool me twice.... As mentioned in a previous tale, there are wall-mounted televisions scattered throughout my office, all constantly displaying our internal test status page to serve as a handy visual indicator for the current state of every test suite, with each suite being represented by its own colored box in the grid. The overall state of the test screens tells us whether it's safe to push our code:
Er, at least I'm pretty sure that last part is in our "best practices" documentation somewhere. (A quick refresher for the non-coding readers: "pushing" code, as in In the leadup to the major release mentioned in my last post (and partly due to the events described in said post), upper management decided at one point that the release schedule was still too slow for their liking and our database needed some serious TLC in general. Having finally acknowledged that we lacked the headcount to give it the necessary attention, they hired a few outside developers with relevant domain expertise to give us a hand. The most notable of those developers was RockStar. RockStar was a coding Superman. Faster than typing on a Dvorak keyboard! More powerful than database superuser privileges! Able to commit thousands of lines of well-written code in a single pull request! And unlike the late and un-lamented Gilderoy, he actually was a rock star developer, with the skills, knowledge, years of experience, and commit history to match, so the whole department was looking forward to working with him. Granted, we wouldn't be working directly with him, since he was working remotely from overseas, but he'd be making valuable contributions to the product and that's what counted. Looking forward to it, that is, until we came in on the Monday after he was hired to find that every television screen in the office was a solid block of red. Someone had pushed code over the weekend (a big no-no) and nearly every last test suite broke; out of dozens of different suites across the entire product, only the very basic "nothing segmentation faulted so everything is probably mostly okay we hope?" suites had passed. Checking the commit history, we quickly determined that RockStar was the culprit and arranged a meeting. We had to wait a few hours given the time difference, and by the time we got him on the phone everyone was slightly panicking. Non-managers were not invited to this meeting, alas, so I don't know exactly what happened, but from the
(He's right, actually. They were flaky as hell and pretty fragile. We inherited them from a previous generation of developers and there was only so much we could do at the time to improve them.)
The local managers and team leads were understandably pissed off about the whole situation, but none of them could force RockStar to listen to them and no one who did have that ability wanted to upset him and possibly make him want to quit, so nothing was done about it. All the teams spent most of the week fixing test failures, and a few days later tiny isolated boxes of yellow and green began to poke their heads hopefully out of the sea of red. And then those hopes were dashed and the passing suites went red again. RockStar had, as promised, pushed more code with zero regard for the state of any test suites. There were, of course, no consequences for this--to RockStar, that is; we mere local devs had to fix all of them. Another week of fixing tests went by. Another few lonely test suites went green, before promptly going red again. Another conference call.
(It really wasn't. The UI was a painful "early '90s website" veneer slapped on top of some backing database logic that had been hacked together by someone who knew literally nothing about SQL.)
(We really didn't. "When in doubt, add several 30-some-minute tests for every tiny new feature you add" was the order of the day, and had been since long before I got there.)
This time there was progress, of a sort. They got RockStar to agree to only push when the local teams informed him that everything was green; he was free to work on his own machine on his own development branches in the meantime so as not to impede his own progress, but bringing the rest of the teams' progress to a screeching halt while they fixed things had to stop. Several days later...success! We spent a week or so getting things back to a good state, and once every TV in the office was a solid cheery green we gave RockStar the go-ahead. At which point everything failed again when he pushed his code, because apparently no one had ensured that he would actually run his dev branch code against the test suites before merging his changes. Which, in hindsight, should have been expected, but was still incredibly frustrating. It was especially frustrating for my team in particular, because as mentioned before we were somewhat of a grab-bag team: not only did we develop and support 30-some separate utilities (or, let's be honest, actively develop and support 10ish of them and ignore the other 20ish until something exploded in a customer environment and we had to frantically put out fires, because, again, woefully insufficient headcount), we were also the ones who handled infrastructure, release engineering, and all the other miscellaneous functions that had lacked a dedicated team since the Great QA Purge, partly because the few survivors of that purge had ended up joining our team and partly because no one else was willing to do it. So of the several dozen boxes on every test screen over half of them represented test suites belonging to my team, and where other teams tended to (for instance) have five suites testing the same product and could usually make them all go green by fixing the same one or two root problems, each box of ours that went red meant an entirely separate time-consuming adventure for us to fix. We thus added "monitor RockStar so he doesn't break stuff too often" to our ever-growing list of duties, and would frequently run his dev branches in our test environment to catch issues before they broke stuff (though not frequently enough to prevent breakages entirely). The cycle of "RockStar pushes code -> we do nothing but fix tests -> we complain to our manager -> nothing changes -> repeat" continued for a good two months or so at least; it's kind of a blur, honestly. Now, since some of you, dear readers, may understandably be wondering why management didn't get rid of RockStar at some point in this mess, I want to digress to emphasize that RockStar wasn't a terrible person, this wasn't an entirely negative experience, and he was having a major positive impact on the codebase with his efforts. RockStar was a pleasure to work with directly through this whole period and always framed things in terms of "Our test setup sucks terribly, I'm doing this for all our benefit" rather than "I hate testing, it's your problem now!"; his disdain was reserved solely for managers who refused to admit that our testing setup was fundamentally broken. He was more than willing to help us fix test failures when we brought them to his attention, since one of the major reasons for him not running the tests was him not being able to see what failed in our test environment. And the test suites were greatly improved as a side effect of this whole proces: Because we were spending so much time fixing tests, we finally got the leverage we needed with management to actually sit down and give them the major overhaul they needed; the old "feature work is higher priority than infrastructure work" excuse rings a bit hollow when you haven't pushed any feature changes in the past two weeks because your test environment is on fire. Minor miracles like deleting 50+ redundant tests from one of our test suites or getting a particularly annoying suite down from 14+ hours per run to under 5 hours per run were big bright spots in the soul-crushing black cloud we were otherwise working under. What finally broke the cycle was not RockStar having a change of heart, or management stepping in, or even my team having a mental breakdown. Rather, it was my department being involuntarily migrated to a different test framework and environment (a debacle which will get a tale to itself, don't worry), because this new test environment was both accessible from RockStar's remote location and modular enough that he could hack together a minimally-useful set of tests to run on his end. Despite all of the other pains involved with this transition, it did cut down the frequency of a single code commit breaking everything from weekly to once every two or three months. And finally--finally!--my team was able to get back to doing things other than fixing the same tests over and over and over again. With RockStar humming along happily and with the most glaring flaws in our test setup addressed, I hoped that that would be the last time someone else not giving a shit about our tests would become my problem. Coming up next: The next time someone else didn't give a shit about our tests and made it my problem. [link] [comments] |
You are subscribed to email updates from Tales From Tech Support. To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, United States |
No comments:
Post a Comment