• Breaking News

    [Android][timeline][#f39c12]

    Sunday, June 2, 2019

    Disaster Recovery or: Redundancy is great until that fails, too Tech Support

    Disaster Recovery or: Redundancy is great until that fails, too Tech Support


    Disaster Recovery or: Redundancy is great until that fails, too

    Posted: 01 Jun 2019 09:16 PM PDT

    I hate disaster recovery (DR) testing. It's such a pain in my ass. Wrangling all the vendors and educating the users, coming up with the plans and formalizing all the documentation for the C-levels. But I know first hand why it's so important. All the little annoyances and frustrations- they're all worth it.

    I hadn't been in my position for very long- only about 6 months or so. My $boss and $coworker-not-appearing-in-this-story were also new. We'd all been hired at the same time and tasked with managing our central database and its related hardware. The system, while not in bad shape, wasn't exactly in the best of shape either. Crucially, (and as we would come to later regret) it needed firmware updates and a few other housekeeping things done.

    I was waiting to finish my last task of the night: import a file into the database when it came in, then notify my coworker in another department. She would then do....something...with that data and then we could both go home. The system had a little console utility that let us view the console messages real-time from the comfort of our desks. I normally kept it up all day just to keep an eye on things.

    That night the file was later than usual. So I sat at my desk, busying myself with something or other while the console utility scrolled serenely on my second monitor. I'd gotten used to its messages; users disconnecting, backup logs being created, background processes starting and finishing. But then something caught my eye- an unfamiliar message. I turned my full attention to the console and read:

    HARDWARE FAILURE DETECTED

    Oh. Well. That's...probably not good. But obviously the server was still alive since the console utility was still scrolling messages. And it wasn't a hard drive failure since that displayed the somewhat more helpful "Hard drive failure detected". I poked what I could and couldn't figure out what was wrong. So I phoned my boss.

    $me: Hey, it's $me. I'm getting hardware errors on the console.

    $boss: ....Ok. Well I'll take a look. I'll call you back.

    Meanwhile, the file we were waiting for had come in but I wasn't comfortable importing it with the hardware error still scrolling across my console. I told my coworker what was going on and she decided to go home. We could deal with the file in the morning.

    $Boss called me back and said he couldn't figure out the source of the error, either. Luckily he lived nearby and 5 minutes later he walked in the door.

    $me: Do you want me to stick around?

    $boss: Nah, go home. I'll figure it out.

    $me: Ok. Call me if you need me.

    $boss: Will do.

    Ah, the benefits of being an hourly employee, I thought to myself as I drove home. Consequently, the remainder of this story was told to me by $boss the following morning.

    He called our hardware vendor and they determined that one of our drive controllers had failed. Not great, but not terrible either. The secondary controller had picked up when the primary failed, just as intended. But then everything went to shit. All our hard drives suddenly dropped- the secondary controller had just failed. The vendor realized that our controllers were on an old firmware version, one that had a serious bug. The controllers were default programmed to undergo a self-test every so often. Normally this wouldn't be a problem since the secondary controller would take over until the primary rebooted itself. Only, because of the bug, the controller never came back up from the self-test. So the primary self-tested and died, and then the secondary self-tested and died. Great.

    They managed to get one of the controllers back up in the wee hours of the morning, but the damage to the database had been done. Apparently it hadn't liked all of its drives disappearing and had suffered unrecoverable corruption. Our CIO had, by that time, also come into the office and he made the decision to declare a Disaster Recovery Event.

    We enacted our DR plan and most of the work to switch over to our DR system was done by the time I got in the next morning. We told our users what had happened and that yes, the DR system was noticeably slower than our production system; please don't call us about it. That mostly worked. Mostly.

    Not long after I got in $boss and CIO went home to get some well deserved rest and I was left to deal with some minor quirks of the DR system, mostly printer related (Don't we all love printers?). But the transition was largely seamless for our users, which is the true goal of any fail-over event. We were on the DR system for about a week while our failed controllers were replaced. And then, well, that was it. We switched back to our production system with no further issues.

    Pretty anti-climactic, I know. But isn't that a good thing with disaster events? We had a plan, we followed it, and no data was lost. And everyone lived happily ever after. Well, except for that one department: no matter which printer they printed to while we were on the DR system, the document always printed on the manager's printer. I never did figure that one out.

    submitted by /u/hyacinth17
    [link] [comments]

    The Email is down. A story totally not like the other one.

    Posted: 01 Jun 2019 09:36 AM PDT

    At the time of this story I worked in a chase the sun support location for a large multi national financial institution.

    Onsite we had a variety of support groups who had 24/7 coverage of the institutions vital infrastructure and IT services.

    I myself worked on the America shift and was a lowly 1st/2nd level help desk op. On my shift I would say we had another 20 people in my team alone, and then there were the guys and gals working the UK/Euro shift so another 30 or so of them. We were the front line for pretty much everything that had a power cable and allowed the institution to make millions of dollars every hour.

    We have good working relationships with the 3rd level support contacts and the escalation points for the specialist teams who actually deal with the real stuff beyond turning it off and on again, and resetting passwords, or teaching someone how to use excel. We still had to know stuff about things but most anything we couldn't solve would get bumped across to a specialist team.

    It was an average night for us, we had calls in the queue but nothing crazy and everyone was just taking it as it came. But suddenly the queues spiked. All of them. From 2 or 3 waiting to 99 waiting. Guess how many digits the queues could display?

    But because this is a modern fiance institution we also have chat channels. All the support channels for general users ping with ctrl red notifications from people in them, all with roughly the same questions.

    WHAT THE FUCK HAPPENED TO THE EMAIL?

    Is your email down too?

    If we can't send this email we are going to lose hundreds of thousands of dollars! Whose your manager?!?!?!!!!

    So hey good news we clearly can see this is an institution wide outage. We have a procedure for this.

    Priority one get the message out we are aware and looking into it.

    Concurrent Priority One contact the email escalation service desk and find out if they are aware (never assume) and if so what do they know at this point?

    The message out was simple, record a new message for the IVR and also update all the support channels. We had that sorted in about 5 or 6 minutes.

    Contacting the email service desk was hard. They weren't answering. Given they are in the same building as us and they work the same hours it's not like they can just disappear. We already knew it wasn't network because all those chat channels were still going. They used the same infra as the email sercices did except for the actual servers but even then those were in the same data centers.

    We had to send one of the team leads down to actually go speak to the the email service desk to get some info so we can actually start telling the users something helpful.

    Good news is that before he gets back email sercices resume globally. YAY the institution is back to making millions per hour again! Woohoo!

    When he comes back it's clear he's pissed off Though. We didn't get any other info though until about 20 minutes later when everything had settled down and we cleared the queues.

    What happened? Well apparently the email service desk realised it was necessary to carry out some emergency maintenance on the global exchange servers primary and backup. So they pulled the servers down and let the escalation and business contacts know they were doing it.

    By sending an email......

    When the global email exchange servers were down.....

    When asked why they didn't answer their phones, they were in a meeting discussing the maintenance procedure steps and didn't want to be disturbed.

    Color me surprised when i read the next week that we had positions available in the email service desk.

    submitted by /u/DonkeyDingleBerry
    [link] [comments]

    I can't type! 'Did you plug in the keyboard?' I can't type!

    Posted: 01 Jun 2019 02:37 PM PDT

    This one happened a while ago back when I was interning at a IT company, but I thought I'd share it with you guys anyways. So basically to graduate my highschool I had to do 8 weeks of internship (there is not a minimum wage, but most companies pay their interns in Austria. Albeit a laughable amount. Then again my duties consisted of sitting in an office and watching youtube videos, so I'm not in any position to complain about the pay being too low) so I was interning at that company for a month or 4 weeks. Since my school specializes in biomedical engeneering I interened at a hospital location, where they would provide tech support for the entire site. Then we get a call.

    I won't insult your intelligence by telling you who is who.

    Boss: OP, User in Room is having trouble with their keyboard. It's not gonna be too hard, just take a keyboard from the shelf, go there and swap it out, you know the drill.

    So I went there and interrupted what seemed to be a very private talk between patient and doctor to find the keyboard that was causing trouble, which wasn't plugged in. I stared at it for a minute in disbelieve and then asked

    OP: Why is this not plugged in?

    User: Well I unplugged it, because it can't be THAT important and I had to copy some data from this USB-Stick and there were no free slots...

    OP: are you done copying the data?

    User: Yes, but the keyboard isn't working.

    Still processing what just happened, I just wordlessly plugged it back in and left.

    The hospital does technically have a policy against USB-Sticks because of the huge IT-safety issues they can cause, but they didn't pay me enough to listen to a rant from an old doctor 5 years from retirement about me, a lowly intern from an external company, telling him, a fucking high and mighty doctor what to do and what not to do. The USB-ban is by the way the reason it's impossible to plug anything into a PC without unplugging something.

    I do feel sorry for the patient, but I'm glad to report that in the two years after this exchange the keyboard hasn't caused any issues ever again that I know of. (That was towards the end of the third week, so I was basically only there for one more week, and they don't keep me informed about what happens to that one keyboard I just so happened to plug in. I do like to think that they're still getting a decent laugh about it though...)

    submitted by /u/Deus0123
    [link] [comments]

    Passive aggression

    Posted: 01 Jun 2019 09:51 AM PDT

    I support healthcare software that keeps electronic medical records. My company has a number of products in the healthcare industry but my team only supports two of them.

    The team I'm on has a running Skype chat going all day to chit chat and bounce questions off one another since we're broken into smaller teams that support parts of those two programs. I support interfaces. Another tech, I'll call him Dave, supports another area of the software.

    A call comes in. Dave posts to our chat with what the customer is giving him for an issue which Dave doesn't understand. But it sounds like an interface issue. So, Dave requests help and I jump on.

    Dave in chat: This guy is asking about some DMS files that he's not seeing crossing the interface. I don't know what DMS is. Me: Yeah, never heard of that. But I'll join. FYI, I have 10 minutes till I need to call a customer.

    I get on the call.

    Me: Hi. I understand you have an issue with an interface? I'm not sure I fully understood the issue. Would you mind summarizing for me?

    Caller: I'm an interface manager for <hospital> and we have a DMS interface with your server which runs your company's product. That server isn't allowing connections and I'm calling to find out if there's been a change or if you want me to shut off the interface.

    Me: I'm sorry but I can't say to shut it off since it's not my data.

    Caller: It's your server though. You guys configured it. $install_engineer (a name I am not familiar with) set it up and configured the interface. We were told not to email him directly and to call into support if we needed help.

    Me: Again, I'm sorry that you were told that but we don't own servers or manage them. I don't have any access to that server. We're strictly software. The server is not ours. Though it makes sense that you were told not to contact $install_engineer since his team just does builds and not support.

    Caller: Okay. So, you don't care what happens and I'm just supposed to let this data build up and not go anywhere?

    Me: I didn't say I didn't care. I want you to get the interface going again but I'm not the support for the other side. And I'm not familiar with the product, so can't say who to call.

    Dave and I have been chatting in our team chat while this conversation is taking place trying to figure out WTF this guy is talking about.

    Dave: While you guys were chatting, I found an old case for <hospital> and found just one case that seems to be related due to the terms being similar. I can create a case and send it to the same team. I'll include your contact information if that would be okay with you.

    Caller: Yeah. I guess if that's what has to happen.

    Me: Okay. It sounds like Dave has this well in hand. I've got to drop for my scheduled meeting with another customer. Dave will get this to the right team and get you taken care of.

    I still don't know how he ended up talking to us. Our team's support line doesn't include a menu option for whatever he was calling about.

    submitted by /u/saint_of_thieves
    [link] [comments]

    Placement matters.

    Posted: 01 Jun 2019 09:35 AM PDT

    LTL, FTP

    Way back in the dark ages of my IT career I worked help desk for a restaurants Corp offices supporting POS (point of sale) systems and would handle calls from managers and owners whenever their system would act up. This was 13-14 years ago, so there may be some paraphrasing.

    Our players $CM: Clueless Manager $HS: Helpful server $Me: me

    Our drama unfolds in the flourescent cube maze of the upper corporate floors. RIINGGG $Me: "This is Grumble, how may I save your pancakes today? " $CM: "Hi, we can't make any sales." (I wish I could make that sentence up.) Troubleshooting ensues, going fairly normal until the rails fall off. $Me: "Ok, what I want you to do is move your mouse to the bottom right-hand corner of the monitor and..." I cut off as I hear an audible thunk from the other side of the phone. $CM: "Ok, I moved it there, now what?" Deep breath. $Me: "Ma'am, did you physically put your mouse on the bottom right hand corner of the monitor? " $CM: " Just like you asked." Cue me asking $CM to hold while I mute them and proceed to laugh myself to tears. Regaining control, I unmute my headset $Me: "Ma'am?" $HS: "She had to go, so handed the phone off to me." We proceed with troubleshooting with no further mousecommuncations.

    I have since moved on and up,but this is one of those things that will stay with me forever.

    submitted by /u/Grumble128
    [link] [comments]

    The Case Of The "Broken" $ThinClient!

    Posted: 01 Jun 2019 09:47 AM PDT

    So, I work for IT at my university, supporting a subset of the whole university. I do a mix between helpdesk and hardware work, and at the time of this story was basically Tier 1 (but we don't really have a tiering system).

    Cast of characters:

    $Potato: Me, a self-conscious humanities/ "soft science" student who somehow managed to land an IT job

    $Scott: The head of the off campus housing center in question--I'd only ever seen his picture on the housing bulletin board, which he replaced with a stock picture of Steve Carrell as Michael Scott

    $Janitor: A well-meaning janitor

    On this story's fateful day, I found a ticket in our queue from a division of our university housing department that was off campus (we have off-campus apartment complexes that are managed by the university). Apparently, the apartment community center's lab computer, which is there mostly to access the Internet and Microsoft, was "non-functional." After trying to remote in, we decided an on-site visit would be best. I live in that particular apartment complex, so I figured I could head over and take a look at the end of my shift.

    Now, an important bit of background for this story--for dorm and apartment lab computers, we use a certain brand of thin clients, I'll call them $thinclients. They're cheap, and easy to replace--all departments put in a sum of money at the beginning of the year, and in exchange they get access to our stock of $thinclients. So, when one goes down, we can take another over, replace it, and troubleshoot the old one. If the old one is beyond repair, there's no cost associated with it for the department. Because of this, I went over at the end of my shift, figuring I'd just either reconnect it to the Internet, change some settings, or just remove it and bring it into work tomorrow, and then bring a new one at the end of the day. (It was a low priority ticket, so delays weren't a big deal).

    Yeah, that wasn't the case.

    I get there to find the community center lab in a state of...disrepair. The table next to it is a sort of apartment supply exchange, where you leave something and take something, and someone, presumably $Scott, had decided to use one of the large paintings from the exchange to block off the community center computer. After moving it, I discovered...just a monitor. Literally, just a monitor. No mouse, no keyboard, no $thinclient. I find $Scott milling around in the back.

    Me: "Hey, Scott, um...I'm $Potato, from IT on campus, here to look at the Community Center computer...are you, um...did you take the keyboard and mouse?

    $Scott: "Nah, some of the students probably took them." *sigh*

    Me: "Okay, we can bring some from our stock, but, um...where's the $thinclient that isn't working?

    $Scott: "Oh, in that cabinet there." *points to a wooden cabinet with carvings* "Let me see if $Janitor has the key."

    I look through the carvings, but I can just see a mess of cables, and no $thinclient. After a moment, $Janitor comes, and explains that he can't find the key. He does, however, help me jimmy it open. (Lockpicking skill increase!) The door swings open to reveal...a completely different model of $thinclient that I've never seen before. After consulting with my supervisor via $Workplacemessagingapp, I learn that these were phased out completely a couple years ago, before my time--or so we thought. We hadn't supported these in literal YEARS. As such, the network port had been deactivated for the whole community center (no one was using it, and it costs money to keep them up.

    After sorting through the mess of cables, I retrieved the $ancientthinclient and did my best to explain the situation to $Scott. He gave me a couple of solutions to propose to my supervisors, and after a long time waiting for the network port to be activated and fiddling with IP settings, we managed to get the Community Center functioning. *muted applause*

    submitted by /u/potatoweeb
    [link] [comments]

    No comments:

    Post a Comment

    Fashion

    Beauty

    Travel