Don't press the big red button! Tech Support

Don't press the big red button!
It's always DNS... and now I'm in trouble...
Short story of why I left the wiring rack unorganized and my boss praised me for it.
Mission Critical Monitor
of running

Posted: 29 Nov 2019 05:39 PM PST

Hello talesfromtechsupport! I fix Xray machines for a living and thought it would be interesting to share a few stories about my job.

To set the scene.

I just got done with a job at a site and made it home. Walked in the door and my phone buzzed with a service call, at 5:30PM, on a Friday. I wasn't thrilled.

I call the customer up (a customer that is still a thorn in my teams side) and ask what the issue is. They explain that their system is not working. The computer is on but the table and OTS (overhead xray tube suspension) has no power whatsoever. I ask if they have checked both emergency power off switches on the table to see if they had been accidentally pressed (this was an ER and a common occurrence). They explain that they had indeed checked them and they wanted me there ASAP. I explain that it will be an additional charge as they are not under contract and it is after normal hours. They say no problem we are getting a PO (purchase order).

So I hop in the man van and hightail it to the site. I arrive to a few frantic xray techs not too happy they are doing everything in another room they hate. They want me to get it running NOW. Ok let me look at it I say. First thing I check is both EPOs that I mentioned above. One was pressed in, most likely from a bed hitting it. I pull it out and what do you know? It's now working perfectly.

I grab my stuff, say y'all are good to go, and mention the bill that will be coming shortly. Their response was "We're in trouble".

Total round trip was 2.5 hours of travel and the minimum 2 hours of labor. All for 10 seconds of work they could have done themselves if they just listened to me. Don't press the big red button and check it if your FE asks you to!

I got a few more if folks are interested.

submitted by /u/Iamonly
[link] [comments]

It's always DNS... and now I'm in trouble...

Posted: 29 Nov 2019 01:35 PM PST

So me being the sole herder of electrons in the company has been a point of contention for a while, namely around holiday. I hadn't taken a full day's holiday this year. My father works overseas and finally came back home after 4 years, so I booked the same time off work - 2 weeks. As it was a family occasion, I finally managed to arrange us a break at a forest resort we used to frequent back in the 90s for the second week. As such, I decided that I would go off the grid for week 2. However, I would have my laptop at home and be available for support for week 1. I hoped anything that would go bang, would do so during the first week.

For the most part, the stuff I've built is all automated. It's not quite Puppet levels yet, but day to day is pretty hands-off. Of course, 2 weeks is time enough for anything to start going wrong, so I tried to arrange cover. A downside is that we're 50% Ubuntu, 50% Mac, so finding an MSP willing to support Linux is difficult, if not impossible. Nobody I asked wanted to know. I was able to arrange phone support with a local AASP to cover the Macs, but Ubuntu support eventually fell to getting one of the developers to cover for me. Not doubting him, he did well given the circumstances, but apparently I sowed the seeds of my own downfall even before I went away.

Our backend is entirely Linux - Devuan VMs running on KVM providing DNS, LDAP and several other services. Because this was running on quite old equipment (quite literally machines I'd pulled out of my junk pile as a PoC for why it would benefit a cloud-heavy company to have development stuff on-premises), I recently got approval to build out proper servers - 3 VM hosts plus a file server, fitting my design of N+2 redundancy for active systems. I was very pleased with the hardware I was able to get my hands on (10 cores, 64GB RAM, 240GB SSD, 4x spinning disks, 10Gb ethernet, ~£1,000 per machine), but my plan to build the machines myself (Supermicro boards in some very neat rackmount chassis) meant that I was dependent on the delivery of bits and pieces to get the machines built. And they came in drips and drabs. The last motherboard didn't turn up until the first week of my holiday. And demand for Windows VMs to test with was rising faster than my ability to find places to run them. So I had to throw the machines together, rack them and get them usable far ahead of schedule (I was intending the switchover for after I came back).

On the plus side, building the machines worked okay. I knew what I was doing, documented the first one and built the next one identically. I was able to migrate the existing VMs without downtime and shut down the original hosts (one was a 2950. Our reduced power bills will soon cover the cost of the new servers...). Machine #3 was on my workbench in pieces before I completed the migration, but that wasn't the cause of the issue.

So, trying to build a modern, secure LAN, I tried to set up everything encrypted and validated. This included SSH fingerprints in DNS. Now, I've used BIND for a long time now, for over a year at work with no problems. I have a master and 2 replicated clients; the master is not given out via DHCP, just the replicas. This has worked well, and generally if I cause a problem on the master, the replicas don't get the updates and I can fix it without affecting the workstations.

And this is where things went wrong. I was also trying to get BIND to accept dynamic updates. I've done so before, running dhcpd on the same server and pushing between daemons, but we have DHCP from the Ubiquiti USG, so I was intending for clients to push their own IPs and SSH keys into DNS. I got this half-working; the rest was a work in progress. Obviously with dynamic BIND zones, making any manual changes requires a 3-step process:

Freeze the zone to prevent dynamic updates interfering
Make the update
Thaw the zone

So, I built the new VM hosts and sent their SSH key fingerprints to the DNS master manually. I followed the 3 steps to add the keys in; I've done this before, no problem. Well, apart from the dynamic zone files being very differently formatted to the static ones, which BIND does by itself. So at step 2, I manually adjusted the entries to match the munged format.

Then step 3 didn't work. The zone refused to thaw. Restarting BIND had no effect. The error message I got made no sense, and turned up minimal results when searching; I had to read the BIND source to figure out what was causing it! But that didn't get me any closer, and removing the entries I added didn't work either. Worse, BIND decided the whole zone, lan.company.com, was now invalid and refused to load it. On the plus side, the change didn't reach the replicas, which continued to serve lan.company.com without any of the updates that broke it. Okay, it's not ideal, but it's close to my holiday, so I'll take it. I decided not to do anything else in case I broke it further.

So week 1 goes smoothly with the expected people bugging me on Slack for assistance. For the most part, I don't mind, since it would take me a mere minute or two to VPN in and make the necessary change, versus my minion having to do something he hasn't done before. I've shared all the instructions and documentation with him, and walked him through some examples, but most of it relates to the hideous expansion of Windows that I want to keep under tight control.

Nothing looks ominous so I go all in for week 2 - I shut off my Galaxy S5, pop the SIM card into my ancient Nokia 6280, leave my work laptop and most of the distracting electronics at home and head into the middle of a forest with my family (elderly grandmother, my dad and his wife) for a week. And it's beautiful. Silent, calm, peaceful.

Until Wednesday.

Leaving the swimming pool for a fireworks display that evening, my phone finally catches the poor signal in the middle of nowhere and informs me of a text. It's my boss, who has my personal phone number, reporting a serious problem and asking me to call. Immediately thereafter, my phone runs flat (the lack of signal is burning through the battery; for those that don't know, phones adjust their transmission power based on the received signal strength, so when you have no signal, your phone is practically screaming in all directions to find a cell tower) and we're heading to a restaurant for dinner. And wouldn't you know it, my dad's and stepmother's phones have the same issue, the spotty signal has caused them to run out of battery as well. So for a few hours I start wondering what could be bad enough to interrupt my holiday. I've accounted for more or less any problem I can think of at this point. But there's always the old failsafe.

switch ($problem): default: return DNS()

I'm really hoping it's not that, because I'm the only one in the company that knows it, even though I'm using fairly standard and straightforward configurations.

Later that night, after cramming some power into my phone (has anyone noticed that old Nokias weren't designed to charge fast? Smartphones are a race to 75%, but dumbphones are a gentle stroll), and searching the resort for signal, I manage to text my boss back. It's around 9PM at this point but she still wants to speak to me. I ask how catastrophic the problem is, and it's bad.

It started off quite innocently. A user couldn't print (don't we all love those?). Now that's unusual since our printers have been pretty reliable. My boss relays that $Minion did some debugging and determined they could still print via IP address. Further research concludes that the round-robins and aliases aren't working, and the whole internal domain is non-functional - lan.company.com isn't resolving, full stop. However, external sites are still working.

Yup, this is a DNS problem, our side. My first solution would be to give BIND a nudge, but there's a further problem - I never granted $Minion access to the backend servers. I couldn't foresee a problem bad enough that he'd need it. Okay, there's a physical option, although it's quite drastic. I instruct my boss how to reboot the physical VM hosts holding the replica VMs (tap the power button, wait for all the VMs to shut down, wait for the host to power off, then start it back up). Figuring that's the best I can do for now, I hang up and go back to our lodge.

I grab my personal laptop (of course I took one with me), jump on the free wifi and start drafting an email to $Minion, outlining how to add himself to the necessary LDAP group to access the backend servers (using the emergency cn=admin password I left in the safe), and what he would then have to do to massage BIND into cooperating. It's then that I get a flash of inspiration. Some of you may have already worked it out.

For those who don't have to deal with DNS: when you define a DNS zone, you set various parameters. One of those is the TTL. (Time To Live) Often, this is set per record, since you don't want a silly-short TTL so the value can be cached, but equally not an excessively long one in case you need to push an update. However, the zone as a whole has a TTL. The replicas periodically pull from the master. The pieces fit together and I figure out that the TTL has expired. Ordinarily, the replicas would pull, but I broke the zone on the master, didn't I. The replicas worked exactly as designed, serving their caches until they expired, but then it all came to a screeching halt.

At this stage I realise there is nothing $Minion can do, and nothing I can do until I get in on Monday. Even the instructions I gave to $Minion to grant him access to the servers won't help - all of the servers use key-only auth, and pull SSH keys from LDAP. How do they query LDAP? DNS round robin, of course. So in short, DNS failing has completely locked everyone out. My only course of action will be to add a screen and keyboard to the physical VM hosts, tinker with /etc/hosts, regain access to the KVM consoles and then massage BIND back into shape. I'm probably going to have to head in on Sunday to do that.

So we got home from our holiday a few hours ago. In that time, I've put my SIM card back in my smartphone, and as soon as it got an internet connection, Slack went crazy with notifications. There's a conversation between $Minion, my boss, the VP of Engineering and the CEO that I've been pulled into. And oh boy, they're (quite rightfully) not happy. In addition to the inconveniences of having to use IP addresses to print, the VPN isn't working (some of the QA people work from home regularly and haven't been able to access their test VMs) and two people's desktops are locked out (some quirk of sssd seems to cause it to randomly abandon cached credentials if it can't access its backend for long enough, even though I tested it extensively by having machines off the LAN for weeks). I sent an apology and my best guess at what happened, including what I'm going to do to fix it.

Of course, I get the flurry of 'what can we do in future?' questions, but not the ones I'm expecting. Several are asking why the LDAP servers depend on internal DNS. Well, SSL, right? Can't validate SSL against an IP address, has to be a FQDN. This is when I get curveballs thrown. The VP of Engineering asks me why I'm running SSL on 'just our internal domain'. Well, passwords? Okay, but then why do I need to validate SSL, not just allow the machines to accept any old certificate? Um, because that would discard about 90% of the security of SSL and allow easy MITM'ing. The CEO stepped in and asked us to continue this conversation on Monday.

Now, I get that I caused the problem, that I'm not denying. But ironically, by trying not to make things worse, I seem to have made them much, much worse, with a neatly delayed reaction. First thing I'm going to do when BIND starts working again is increase the zone TTL to a month...

I suspect I'm also about to have a very difficult conversation about why I'm following best practises when 'it works' will get the job done...

TL;DR - It's always DNS. I managed to cause a time-delayed major outage while off the grid for a week.

submitted by /u/gargravarr2112
[link] [comments]

Short story of why I left the wiring rack unorganized and my boss praised me for it.

Posted: 30 Nov 2019 01:22 AM PST

So I work for a chain of hospitals that are newly purchased from another company. So I was tasked with a small team to change over the complete network to our servers and network equipment. So at the end of the week everyone else had gone home. I had 3 hrs till my flight home and it seemed like the work was only mostly done. I tell my coworker that I dont think I'll make my flight and to change it for me. He comes back after two hours saying that there isn't a flight until the next day in the afternoon. So I say shit I have to get on my flight. So I stop working grab my tools and run out the door. My flight leaves in 1 hr and I'm not even at the airport yet. It takes me 30 minutes to drive to the airport. I drop the rental car and book it to the terminal. I get through security in 15 minutes and make it to my gate 10 minutes before my flight leaves. I make my flight and when I get back to the office the next week my boss calls and I told him the state of the track and his response was "Are you kidding. That is the smoothest one of these things has ever gone. We will go back when the old servers shut down and finish the job then." I feel lost .

submitted by /u/ChronosHorse
[link] [comments]

Mission Critical Monitor

Posted: 29 Nov 2019 06:10 PM PST

So this happened yesterday. Unfortunately. Keep in mind its Thanksgiving, and I'm basically the last line of defense for those lucky enough to get to stay home with their families and have a holiday.

$Me: standard dead inside greeting

$RelativelyInnocentUser: Hey, our second monitor isn't turning on. No lights, no messages on screen.

$Me: Ok, let's make sure all those cables are seated properly, un-plug and replug the power and then display cables

$RelativelyInnocentUser: The what?

$Me: Internal monologue: oh goody, one of these calls. Good thing my end users just keep people alive for a living but has no idea what a power cable looks like. The power cables. Just trace the largest, likely black, cable on the back of the monitor to the power strip and unplug it then plug it back in.

$RelativelyInnocentUser: Still nothing after an appropriate length of time/scrambling noises ugh so dusty...

$Me: Ok, what's the device ID, I'll remote in and see if there's anything I can tell.

Fast forward the story a bit, I can remote in, she can see me moving on the second screen, so I set it up to be the main screen and think, here's a fix for less than 18 hours. I'll throw the ticket over to the team that can fix it on site or at least check it out and see if the end user just doesn't know how to hit a Source button or if it's an actually dead monitor. Some brain filtering of non-important information occurs during this conversation.

Do that.

Not 5 mins later....

ring ring ring banana phone

$Me: standard dead inside greeting

$THIS_MOTHERF'er :Hi, I'm doctor $THIS_MOTHERF'er, and I need this monitor working.

cue my annoyance because a workaround is never good enough no matter how short a time and its clearly a patient critical issue OH WAIT NO ITS NOT.

$Me: Is this the so and so department PC? Computer ID such and such?

$THIS_MOTHERF'er: That's right. And I need this computer working. It's critical. I'm the head of long spiel of bullsh* only $THIS_MOTHERF'er cares about*.

$Me: My understanding is that you have a monitor working? I worked with $RelativelyInnocentUser to get one up and running. Is that not the case?

$THIS_MOTHERF'er : One monitor is, but we need the other. It's critical to the entire city!

$Me: wut?

$THIS_MOTHERF'er : This monitor is critical for the whole city. I can deal with it until 11 tonight, but my replacement the king sh of wtfever ville for all I was listening* won't do it. He's the head of some bs. So you need to just bring me a new one.

$Me: ...You've got a working monitor though, the ticket is high priority they'll be in first thing in the morning to replace/repair it. So-

$THIS_MOTHERF'er: You have people on call don't you?!

$Me: Theoretically, yes, but that's for life threaten-

$THIS_MOTHERF'er : So just call them in and tell them to get me a new one. What's your name again?

$Me: $Me.

$THIS_MOTHERF'er : And your boss?

$Me: $Boss.

$THIS_MOTHERF'er : I expect a status call about it soon. click

Now you get the name. I'm livid. It's a second monitor. You've got a working one, just switch tabs you f*. Not only are you an a-hole, but you somehow have a 'city critical' device in a place that is privatized and not state funded wtf ever?! What does that even mean!?

I had to call someone in on Thanksgiving. I failed. On a dumb issue that our department 'had' to cave in to this tyrannical jerk. This is more rant-y than funny, but I mean this dude's job is to save lives and apparently he has to make statements like "this monitor is important to the entire city".

The tech I had to page in about it was super cool, and if I ever meet him I'll buy him whatever he wants, but the sound of his family in the background laughing and having a good time hurt a little bit, because I was the one calling him away from that.

And to cap it all off, when the tech called me back to update me, all he had to do was reseat the cables and turn the monitor on (then reset the display settings to show on both) but SERIOUSLY. $RelativelyInnocentUser lied to me OF COURSE. And didn't do the simple fix that would have saved calling this tech away from his family. Users, man.

Sometimes this whole helping other people career sucks.

submitted by /u/Lokirial
[link] [comments]

of running

Posted: 29 Nov 2019 06:06 AM PST

2 years ago

$me: obvious

$cw: my hero

$dba: another hero

One of these days, busy with too many things at the same time. Is this relevant? Well yes ofcourse!

Clickety-clickety

Import file?

Clickety Yes

Replace rule?

Clickety Yes

WTAF

Cue empty list that used to consist of well over 100 rules. Just my newly imported rule in there.

Oh shite

Cue me running across the hallway

$me: "Houston we have a problem"

$cw: "What's up"

$me: "Check $UEM-software console, the policies tab"

$cw: *whistles*

$me: "Now what?"

$cw: "Let me check. I have this habit of making backups of everything I am going to touch before actually touching it. Yep I got an export of this table, it's a few months old."

$cw Imports backup file.

$me: "Thank you, that looks a lot healthier already."

$cw: "Yep, now we need to restore an actual backup of the database because some other stuff might not be in sync."

We walk over to the database guys

$cw: "$dba, what do we need to do to get the database restored from $UEM-software so we can extract the missing pieces and put them back where they belong? It's on this server in this database, tables so-and-so"

$dba: "Ah. You know the actual name of the server, the database AND what tables you want. No need for a ticket then, it's gonna be done this afternoon."

It pays to do your homework with these guys

*** later that day ***

$cw: "I restored the missing rules from the database export, they were just 4 or 5. Chances are, if nobody logged in during the empty list time, nobody even noticed."

$me: "Good to know, but I still will be bringing this up in the next team-meeting. If only because we found out there's no daily granular backups made from those tables in $UEM-software and there's nothing to spice up those meetings like a good old-fashioned $%^##$-up by yours truly"

$cw: *grins*

TL;DR ALWAYS READ before you click, saves you from running across the hallway.

Aftermath: As a result of that team-meeting, a lot of backup/DR/recovery/update/patching/monitoring plans were made and executed, not only for $UEM-software. It seems I had gently touched a closet and the wall behind came down showing a complete roomfull of skeleton-filled-closets.

submitted by /u/evasive2010
[link] [comments]

Breaking News

Tech Support

Saturday, November 30, 2019

Don't press the big red button! Tech Support

Don't press the big red button! Tech Support

No comments:

Post a Comment

Popular

Blog Archive

Fashion

Beauty

Travel