• Breaking News

    [Android][timeline][#f39c12]

    Tuesday, December 4, 2018

    How a production server got rebooted (twice!) in the middle of the day Tech Support

    How a production server got rebooted (twice!) in the middle of the day Tech Support


    How a production server got rebooted (twice!) in the middle of the day

    Posted: 03 Dec 2018 09:40 AM PST

    This is a tale of the day a major production server rebooted, twice, without notice or warning.

    A couple years ago, I worked for a company that developed accounting software. I was responsible for administering our SaaS infrastructure. Because our SaaS infrasture was actually managed by an MSP, I did more babysitting of the MSP than actually administering servers.

    The characters in this story:

    • $Me - Aardvark, a sysadmin growing increasingly frustrated at the incompetence of our MSP.
    • $Boss - My boss who does his best to provide me with support, but ultimately answers to the bean counters.
    • $SupportTech - Timid customer support engineer who handles customer calls and triages alerts. Nice guy, but is always afraid to bother me.
    • $MSPMonkey1 - First of our monkeys at MSP.
    • $MSPMonkey2 - Second of our monkeys at MSP.
    • $MSPRep - Our account rep at MSP.
    • $DatabaseServer - A single-point-of-failure server that everyone's afraid to touch.

    This story starts when I'm in the middle of writing an automation script to help improve the time it takes to onboard a new customer on our SaaS platform. I then get an IM:

    $SupporTech - Hey, Aardvark, I hate to bug you. Are you busy?

    $Me - Yes. Can I chat with you later?

    $SupportTech - Ok

    I'm really focused and don't want to divert my attention, so I respond with the truth, I'm busy.

    A few minutes later my boss walks up to my desk.

    $Boss - Hey Aardvark, do you have any idea about what's wrong?

    $Me - Something's wrong?

    $Boss - Yeah, $SupportTech says he was letting you know we're having a problem with $DatabaseServer.

    $Me - Oh sh*t, really? He sent me an IM asking me if I was busy. I didn't know he was trying to talk to me about $DatabaseServer.

    $Boss - Can you please drop what you're doing and take a look at it?

    $Me - Of course.

    I then open my IM window with $SupportTech and start chatting with him again.

    $Me - Hey, were you needing to talk to me about $DatabaseServer?

    $SupportTech - Yeah, we keep getting calls from customers that everything is down, and we're getting alerts about $DatabaseServer

    $Me - Okay, I'll take a look at it

    $Me - BTW, thank you for being appreciative of my time, but you're more than welcome to interrupt me about alerts with $DatabaseServer. We're all aware how fragile that artifact is

    Now the troubleshooting begins. First thing's first, let's log in to the server and check that the database service is running.

    $ ssh aardvark@$DatabaseServer Last login: <six months ago> from <myip> [aardvark@$DatabaseServer ~]$ service <dbserver> status <dbserver> is running... 

    Okay, the service is running, that's great! We also have a script that checks for basic configurations and other things known to break the database. Let's run that!

    [aardvark@$DatabaseServer ~]$ sudo /opt/automation/<dbserver>_checks.sh <dbserver>_checks.sh v1.13 Checking integrity of database...[OK] Checking all processes are responding...[OK] Checking network connectivity...[OK] Checking firewall rules...[OK] Checking date of last backup...<3 hours ago>[OK] Checking for outdated packages...[WARNING] Found 32 outdated packages. Consider updating system packages. See <dbserver>_checks.log for a list of packages needing update. 5 checks passed and 1 warning See <dbserver>_checks.log for more details 

    Well, that looks okay. There are some outdated packages, but that's fine. We haven't ran updates on the server in a couple months and MSP regularly pulls down new packages in their managed repository. I'll make a quick note to schedule this server for patching.

    I open my IM window again:

    $Me - Hey, $DatabaseServer looks fine. Can you confirm for me that customers are able to connect?

    $SupportTech - Yeah! Everything started working again shortly after you last IM'd me. What happened?

    $Me - No clue. I'm going to take some time and look.

    Back to the server...

    [aardvark@$DatabaseServer ~]$ last -w -n 10 aardvark pts/1 <myip> <15 minutes ago> still logged in $MSPMonkey1 pts/0 <mspIp> <20 minutes ago> (00:08) reboot system boot <kernel version> <25 minutes ago> $MSPMonkey1 pts/0 <mspIp> <40 minutes ago> (00:15) ... and so on 

    Huh, that's odd. It certainly looks like $MSPMonkey1 rebooted this database server! To my e-mail I go:

    To: support@msp

    CC: $MSPRep, $Boss

    Subject: Reboot of $DatabaseServer at <time of reboot> by $MSPMonkey1

    $MSP,

    Based on system logs, $DatabaseServer was rebooted at <approximate time of reboot>. Can you please let me know why this server was rebooted?

    <Inserted screenshot of the `last` command>

    Please respond ASAP. $DatabaseServer is critical to production and we can't have it being rebooted in the middle of the day like this.

    Thanks,

    Ashamed Aardvark

    I then pick up my phone and call $MSPRep. If I want a response in anything less than their SLA of 5 days, $MSPRep has to escalate it for me.

    $MSPRep - Hey Aardvark! How are things? I'm gussing you're calling about the e-mail you just sent.

    $Me - Yes, I am.

    $MSPRep - Oh man, that looks awful! I'm so sorry $DatabaseServer was rebooted.

    $Me - Thanks for your concern. Can you have someone from your team investigate this ASAP?

    $MSPRep - Absolutely! I'll escalate this and get $MSPMonkey2 to look at it right away!

    $Me - That would be great.

    $MSPRep - Oh hey, since I have you on the phone, I wanted to talk to you about <some new MSP service> that we're offering all of our customers, have you had a chance to look at the pricing I sent over to you last week?

    $Me - Nope, and I don't care right now. $Boss is looking at it. Please let me know what $MSPMonkey2 finds out. Bye.

    SIGHUP

    I send a quick e-mail to $Boss and $SupportTech with what I found, letting them know that $MSPRep will escalate the ticket I opened with them.

    I put my headphones on and get back to my automation script. Everything's going great until a few minutes later when my IM window starts flashing:

    $SupportTech - Hey Aardvark, sorry to bother you. Something's wrong with $DatabaseServer again. We're getting lots of alerts and calls.

    $Me - What? No way. :facepalm:

    Let's hop on the server and see:

    $ ssh aardvark@$DatabaseServer ssh: connect to host $DatabaseServer port 22: Connection refused 

    That's not cool at all. Just as I'm about to pick up my phone to call the MSP support desk, I get a call from the area code of our MSP.

    $Me - This is Aardvark.

    $MSPMonkey2 - Yeah, hey Aardvark. This $MSPMonkey2 from MSP. I'm so sorry, but, I think something just happened with $DatabaseServer.

    $Me - Okay... what do you mean?

    $MSPMonkey2 - So, I was talking with $MSPMonkey1 and he said he didn't reboot it. So I logged in to see if I could see who rebooted it. I typed reboot and then the machine rebooted.

    $Me - Can you repeat that, you typed, reboot?

    $MSPMonkey2 - Well, yeah, $MSPMonkey1 said you can see who rebooted a server by typing reboot.

    $Me - You're f***ing kidding me?

    $MSPMonkey2 - and, well, I think reboot actually just reboots the server.

    $Me - Yes, that's exactly what it does. Is it back now? (Simultaneously, $SupportTech IMs and says it's back up and running)

    $MSPMonkey2 - Yeah, it looks fine.

    $Me - Please don't attempt to login to $DatabaseServer again. I'll conduct my own investigation into the first reboot.

    $MSPMonkey2 - Again, I'm so sorry it happened. Is there anything else I can do for you?

    $Me - Nope. Bye.

    SIGHUP

    I walk into $Boss's office and tell him the story. We both simultaneously facepalm and have a good laugh over this and start talking about what we're going to tell our customers. Afterwards, I go back to my desk to continue working on my automation script. Later that day, I get a new ticket notification from MSP that looks like this:

    Subject: Scheduled maintenance for all Red Hat Enterprise Linux 6 Servers

    Dear customer,

    We at MSP have identified a configuration in our systems that can lead to accidental reboots of servers. The default configuration for RHEL 6 is to immediately reboot a server when CTRL+ALT+DEL are pressed at the console. Technicians have inadvertantly rebooted servers by logging into VM consoles, seeing a blank screen, and sending the CTRL+ALT+DEL keystroke to the console. On a Windows server, this would show a login prompt, but will reboot a RHEL 6 server in the default configuration.

    This change has already been applied to your Linux servers, however, it will not take effect until a reboot has occurred. We have scheduled this to occur during <next normally scheduled maintenance window>

    Please contact $MSPRep for more information.

    I take a minute to read it and then everything clicks together.

    tl;dr

    $MSPMonkey1 rebooted a production server by sending CTRL+ALT+DEL to a blank VM console window, then proceeded to tell $MSPMonkey2 that you can type reboot in Linux to see who last rebooted a server. $MSPMonkey2 reboots a production server for the second time in one day by typing reboot.

    submitted by /u/Ashamed_Aardvark
    [link] [comments]

    We're a modern company.... Why are you laughing?

    Posted: 03 Dec 2018 08:42 PM PST

    I work for a trucking company not in IT. When I was hired, the HR person was impressed with my AA in InfoSystemsAdmin (yeah, I got a two year in Tech Support). "Well, you'll like it here, we've modernized."

    Yeah. Sure. Maybe in the main office where I'm not allowed. Not in the shop where I am.

    • The machine that runs the pumps? It has a floppy disk drive.
    • The machine in the parts room? No USB ports.
    • The machine in the admin office? 182 days since last reboot (until I started doing so every weekend) *The most powerful machine? 2.8 Ghz Dual Core.

    So on top of all these things, we're supposed to be able to remotely connect to the trucks and see real-time audio and video in order to check the safety of the drivers. Current lag time is 43 minutes. Actually had one test drive FINISH and the driver return to watch the whole drive.

    I'm so frustrated with the situation that I've decided, after 9 months, to move on. Maybe the Amish are hiring...

    Onto why I've said all the above.

    For the third day in a row, our shop computer can not connect to the web. It's how we order parts and track trucks (to know when drivers are coming in). So we call IT. They suggest turning on the WiFi. We do not have the option. They don't understand and try to walk us through the steps. My co-worker, an ex-Marine Tanker, hits his limit. He leaves. I continue working with IT, getting no where. Since the other machines connect fine, I figure it's just an IP conflict (or maybe a dead network card).

    It's about this time, I hear the shop door close. My co-worker has brought an IT tech. Tech looks at the machine and says, "I see. I'll make a call." An hour later, we had a visit from an outsourced pro Troubleshooter. it was an IP conflict.

    submitted by /u/BrogerBramjet
    [link] [comments]

    "It can't be that bad"

    Posted: 03 Dec 2018 03:49 PM PST

    Someone made a comment in another post on provider - client interaction. "It can't be that bad."

    Oh?

    This was about 10 years ago, I was supporting our biggest and most important client. At the time we had a service that would try to identify fraud via purchase habits. When the service saw a potentially fraudulent transaction what happens beyond that point is largely up to the client. We could stop the transaction or let it go through. We could block the card or not. We could add it to reports in a variety of ways. We could notify the client via fax or secure email. The whole point was to stop fraud while causing minimal inconvenience to the end customer.

    Given enough data and some leeway for our risk analysts we could develop a pretty good strategy. The rate of false positives would be low, the losses from fraud would also be low, and we would reliably be able to find the point of compromise. It was a good product, I was routinely impressed by it. I especially liked how the service learned. If it flagged potential fraud the client was supposed to indicate if it was actually fraud. The service then incorporated that result into its modeling. Over time it got better at stopping fraud but also letting legitimate transactions go through.

    However, this required the client to collaborate with our risk analysts (the guys who modified the filters), the developers (who understood how the program works), and me (their primary contact who knew the program and acted as the go - between). And since this product can prevent some 100K's worth of losses annually there is a lot of incentive for good collaboration.

    You'd think.

    Anywho, we setup this VIP client on our standard filters while I tried to engage with them on an overall strategy. How aggressive did they want to be? What sort of trends did they see in their own data that I could pass on to the analysts? What process did they have to review the fraud notifications?

    They never did engage with me on that, but they did from time to time issue edicts based on a single case, which always did more harm than good. We had a loss at a gas station? Okay, stop all transactions of that type at a gas station! Now we have customers screaming at us they can't buy gas! Okay, authorize all transaction at a gas station! Now we have fraud again? Why doesn't your stupid program stop these criminals! etc.

    Occasionally they would go into the self-service tool and make changes themselves. At one point they accidentally changed a setting to deny all United States sourced transactions. They blamed us for that, we shouldn't have made it that easy. They wanted to deny all transactions outside of the US of course.

    In parallel to that, the client contact who got the notifications of fraud kept complaining that they were all wrong. I again tried to engage on that (uh what do you mean) and was ignored until the next complaint that they were all wrong. Repeat.

    Eventually my client contact agreed to a meeting, where I hoped we would finally be able to discuss and settle on an overall strategy. To make sure that happened I lined up a developer, a risk analyst (ditto), and a VP just in case. (After my experience with Bob I vetted these people. They wouldn't scoff at the client.)

    So the client comes on and I introduce everyone and turn the meeting over to the client side. My contact begins with pointing out that the notifications were very expensive. "Who is going to pay for all this paper?"

    Paper? Uh what?

    Over the next few minutes my contact explained that their fax machine goes through 100's of pages a day. And when they come back in over a weekend there was often close to a ream of paper sitting on the machine. I explained that those pieces of paper were notifications of possible fraud and we suggest that they review them. She said "Isn't that your job?" She was just throwing them out.

    At the end my contact sent me an email asking me to turn off the fax notifications. A few minutes later I did since she had enough privileges to request this change. She said they'll rely upon the secure emails, which I knew for a fact someone else at the client was ignoring because they were confused by the secure email interface (again I try to engage, happy to help you!). That contact also knew that we faxed them over too, so she didn't have to read the emails anyway. Someone else was handling it.

    I tried to explain the situation to both of those folks and a number of others but got nowhere.

    So to recap.

    No learning by the service, it wasn't getting better. Little or no review of possible fraud. No stopping a pattern of fraud when it was small, before big losses. No attempt to engage in a better strategy with the experts on our side. No attempt really to understand how the service works at all, but let's make company wide changes in production anyway. No understanding of their own internal organization and who was doing what. etc.

    Shortly after they turned off the fax notifications they had a major point of compromise and lost a fair amount of money. Which they blamed us for. Our product failed. Why did it suck so bad?

    I guarantee others have seen a lot worse.

    Gaur-Un-Teed.

    submitted by /u/dave999dave
    [link] [comments]

    I can haz maths

    Posted: 04 Dec 2018 03:53 AM PST

    Up late (early) with an earache, browsing the interwebz, and remembered this funny story from earlier today.

    Me: senior team lead for a region of a state government agency.

    Sally: nice caseworker.

    Scene: I'm visiting one of the offices I support to handle a trouble ticket and check up on hardware I hope to deploy soon.

    This office has a whiteboard the staff use for interesting things and games. There's been a sudoku puzzle on it for a few weeks that appears to have stalled, and I'm standing in front of it pondering the possibilities when Sally walks by and asks if I like maths.

    I tell her I'm OK with numbers and that I enjoy the occasional sudoku.

    Sally heads back to her desk but comes back a minute later with a pad of paper and a pen and asks my age and birth year, which I give her, wondering a little what she's up to.

    She excitedly jots down the two numbers, stacked one over the other, proudly strikes a line under them, sums the two and with great panache writes the sum under it all: 2018.

    She says something about "... only one in a thousand..." and I'm thinking "really?" but her enthusiasm is intense.

    Apparently someone had pointed out to Sally's mum that a birth year and an age will only sum up to the current year for a very small number of very special people, and Sally's mum had passed along this special knowledge to Sally, who had just then pointed out this special fact to me, and I was one of the special ones.

    I put on my friendly skeptic face and mention that after one's birthday and until the end of the current year, everyone's age and birth year sum to the current year because ones age is a measure of the number of years between your birth year and the current year, and full credit to Sally, her eyes gain that shade of recognition very quickly and she gets a little shamefaced. We had a good laugh, as did several others around the office.

    It was a funny little event that resulted in laughs all around and maybe a little sympathy for Pythagoras.

    submitted by /u/music2myear
    [link] [comments]

    Database Support 14: Red Screen of Death

    Posted: 03 Dec 2018 02:42 AM PST

    Last time on Database Support: Fool me once, shame on you. Fool me twice....


    As mentioned in a previous tale, there are wall-mounted televisions scattered throughout my office, all constantly displaying our internal test status page to serve as a handy visual indicator for the current state of every test suite, with each suite being represented by its own colored box in the grid. The overall state of the test screens tells us whether it's safe to push our code:

    • Green means the current code is passing all the tests in the suite and you can push new code that touches that suite at will.
    • Yellow means the tests are currently running and you should keep an eye on that suite before pushing anything.
    • Red means don't push any code and go yell at whoever broke that suite and glare at them until they fix it, and anyone who pushes code when tests are red should be smacked upside the head repeatedly until they reach the instruction-following proficiency of the average kindergartener because pushing on red will just break things more and this shouldn't be difficult to understand, god fucking dammit.

    Er, at least I'm pretty sure that last part is in our "best practices" documentation somewhere.

    (A quick refresher for the non-coding readers: "pushing" code, as in git push, means submitting it to the central repository and merging it with everyone else's code. "Pushing on green" means merging things when all the tests have passed, so if the tests break you know your changes caused it; "pushing on red" means merging things when code is already broken, so tracking down the cause of a given failure gets very difficult, and in general doing that is bad practice.)

    In the leadup to the major release mentioned in my last post (and partly due to the events described in said post), upper management decided at one point that the release schedule was still too slow for their liking and our database needed some serious TLC in general. Having finally acknowledged that we lacked the headcount to give it the necessary attention, they hired a few outside developers with relevant domain expertise to give us a hand. The most notable of those developers was RockStar.

    RockStar was a coding Superman. Faster than typing on a Dvorak keyboard! More powerful than database superuser privileges! Able to commit thousands of lines of well-written code in a single pull request!

    And unlike the late and un-lamented Gilderoy, he actually was a rock star developer, with the skills, knowledge, years of experience, and commit history to match, so the whole department was looking forward to working with him. Granted, we wouldn't be working directly with him, since he was working remotely from overseas, but he'd be making valuable contributions to the product and that's what counted.

    Looking forward to it, that is, until we came in on the Monday after he was hired to find that every television screen in the office was a solid block of red. Someone had pushed code over the weekend (a big no-no) and nearly every last test suite broke; out of dozens of different suites across the entire product, only the very basic "nothing segmentation faulted so everything is probably mostly okay we hope?" suites had passed.

    Checking the commit history, we quickly determined that RockStar was the culprit and arranged a meeting. We had to wait a few hours given the time difference, and by the time we got him on the phone everyone was slightly panicking. Non-managers were not invited to this meeting, alas, so I don't know exactly what happened, but from the detailed after-action report I got from CoolBoss vague sourceless rumors I heard, the meeting went something like this:

    $manager: RockStar! I'm glad we finally got a hold of you! Have you seen the state of the tests?
    RockStar: No, but I've been emailed the details.
    $manager: You know your commits were what broke them, right?
    RockStar: Yeah.
    $manager: Well? Did you even run all the tests before you committed your changes?
    RockStar: No.
    $manager: Uh...what? Why not?
    RockStar: Because your tests are terrible and useless and I refuse to run them?

    (He's right, actually. They were flaky as hell and pretty fragile. We inherited them from a previous generation of developers and there was only so much we could do at the time to improve them.)

    $manager: Wha--but--you can't just not run tests!
    RockStar: Yeah I can. I can't run them on my local machine because the setup and teardown scripts can't handle that, and I can't access the fancy dashboard that's only accessible from your office--which is also terrible, by the way--so I don't know when any of those suites fail.
    RockStar: And I don't care.
    RockStar: I'm not going to wait over 12 hours to see if something passed or failed when there's no way of knowing whether any failures are actually real failures or just networking errors from your shitty test environment or whatever.
    $manager: But--
    RockStar: If something fails, have the team that owns those tests get in touch with me and we can figure it out. Or don't. Again, I don't care.
    click

    The local managers and team leads were understandably pissed off about the whole situation, but none of them could force RockStar to listen to them and no one who did have that ability wanted to upset him and possibly make him want to quit, so nothing was done about it.

    All the teams spent most of the week fixing test failures, and a few days later tiny isolated boxes of yellow and green began to poke their heads hopefully out of the sea of red.

    And then those hopes were dashed and the passing suites went red again. RockStar had, as promised, pushed more code with zero regard for the state of any test suites. There were, of course, no consequences for this--to RockStar, that is; we mere local devs had to fix all of them.

    Another week of fixing tests went by. Another few lonely test suites went green, before promptly going red again. Another conference call.

    $manager: Look, RockStar, the teams here in the $location office are basically at a standstill because you keep breaking their tests.
    RockStar: I'm sorry about that, but I'm not going to spend three solid workdays on a ten-minute task because the test cycle is too long and I'd have to run every suite two or three more times after a failure to make sure it's an actual failure and not a networking issue or whatever.
    $manager: Still, can't you run at least some of the tests?
    RockStar: I do. The standard suites. Locally.
    $manager: We mean running tests for each team, and doing it in a multi-host environment.
    RockStar: No.
    $manager: What can we do to make it easier for you to develop without ignoring the tests?
    RockStar: You could rewrite your test dashboard and change your testing strategy to not suck.
    $manager: I really don't think we can do that. Our test monitoring dashboard is already well-written--

    (It really wasn't. The UI was a painful "early '90s website" veneer slapped on top of some backing database logic that had been hacked together by someone who knew literally nothing about SQL.)

    $manager: --and we really need to run all of those tests to ensure our code is solid.

    (We really didn't. "When in doubt, add several 30-some-minute tests for every tiny new feature you add" was the order of the day, and had been since long before I got there.)

    $manager: We need to keep running the tests as they are!
    RockStar: Then I'm going to keep not running the tests as they are.

    This time there was progress, of a sort. They got RockStar to agree to only push when the local teams informed him that everything was green; he was free to work on his own machine on his own development branches in the meantime so as not to impede his own progress, but bringing the rest of the teams' progress to a screeching halt while they fixed things had to stop.

    Several days later...success! We spent a week or so getting things back to a good state, and once every TV in the office was a solid cheery green we gave RockStar the go-ahead.

    At which point everything failed again when he pushed his code, because apparently no one had ensured that he would actually run his dev branch code against the test suites before merging his changes. Which, in hindsight, should have been expected, but was still incredibly frustrating.

    It was especially frustrating for my team in particular, because as mentioned before we were somewhat of a grab-bag team: not only did we develop and support 30-some separate utilities (or, let's be honest, actively develop and support 10ish of them and ignore the other 20ish until something exploded in a customer environment and we had to frantically put out fires, because, again, woefully insufficient headcount), we were also the ones who handled infrastructure, release engineering, and all the other miscellaneous functions that had lacked a dedicated team since the Great QA Purge, partly because the few survivors of that purge had ended up joining our team and partly because no one else was willing to do it.

    So of the several dozen boxes on every test screen over half of them represented test suites belonging to my team, and where other teams tended to (for instance) have five suites testing the same product and could usually make them all go green by fixing the same one or two root problems, each box of ours that went red meant an entirely separate time-consuming adventure for us to fix.

    We thus added "monitor RockStar so he doesn't break stuff too often" to our ever-growing list of duties, and would frequently run his dev branches in our test environment to catch issues before they broke stuff (though not frequently enough to prevent breakages entirely). The cycle of "RockStar pushes code -> we do nothing but fix tests -> we complain to our manager -> nothing changes -> repeat" continued for a good two months or so at least; it's kind of a blur, honestly.


    Now, since some of you, dear readers, may understandably be wondering why management didn't get rid of RockStar at some point in this mess, I want to digress to emphasize that RockStar wasn't a terrible person, this wasn't an entirely negative experience, and he was having a major positive impact on the codebase with his efforts.

    RockStar was a pleasure to work with directly through this whole period and always framed things in terms of "Our test setup sucks terribly, I'm doing this for all our benefit" rather than "I hate testing, it's your problem now!"; his disdain was reserved solely for managers who refused to admit that our testing setup was fundamentally broken.

    He was more than willing to help us fix test failures when we brought them to his attention, since one of the major reasons for him not running the tests was him not being able to see what failed in our test environment.

    And the test suites were greatly improved as a side effect of this whole proces: Because we were spending so much time fixing tests, we finally got the leverage we needed with management to actually sit down and give them the major overhaul they needed; the old "feature work is higher priority than infrastructure work" excuse rings a bit hollow when you haven't pushed any feature changes in the past two weeks because your test environment is on fire. Minor miracles like deleting 50+ redundant tests from one of our test suites or getting a particularly annoying suite down from 14+ hours per run to under 5 hours per run were big bright spots in the soul-crushing black cloud we were otherwise working under.


    What finally broke the cycle was not RockStar having a change of heart, or management stepping in, or even my team having a mental breakdown. Rather, it was my department being involuntarily migrated to a different test framework and environment (a debacle which will get a tale to itself, don't worry), because this new test environment was both accessible from RockStar's remote location and modular enough that he could hack together a minimally-useful set of tests to run on his end. Despite all of the other pains involved with this transition, it did cut down the frequency of a single code commit breaking everything from weekly to once every two or three months.

    And finally--finally!--my team was able to get back to doing things other than fixing the same tests over and over and over again. With RockStar humming along happily and with the most glaring flaws in our test setup addressed, I hoped that that would be the last time someone else not giving a shit about our tests would become my problem.


    Coming up next: The next time someone else didn't give a shit about our tests and made it my problem.

    submitted by /u/db_dev
    [link] [comments]

    No comments:

    Post a Comment

    Fashion

    Beauty

    Travel