re: RGnet's excessive problems with service, ref. ticket #1523. Dear Susan, I'm sorry this took so long. The event itself hadn't ended until the 2nd of April. And I've been busy with the usual (as you can see it's after 2AM once again...) The recap is incredibly long, I'm sorry to say, there are just a lot of screwed up events. For the most part, INSC staff was as helpful as they could be under the circumstances. They don't seem to have any useful tools. Things were OK with one person on a shift, but a day or two later the information was lost or garbled, ie. scheduled repairs turned into redundant "testing". To make matters worse, the escalation process in the only document we received from Sprint, a quotation/agreement emailed to us when we signed up, proved to be completely useless. For starters, most of the phone numbers and pager numbers were wrong. We had approximately 18 hours of downtime from 22 March through 2 April. This is ridiculuous to say the least. In compensation, I'd like to see Sprint waive our bill for March/April. I would like to have a current escalation process document. And I would hope Sprint upgrades it's NOC to have the tools necessary to do the job. A regular mailing list to customers NOCs wouldn't hurt -- you could mail out new escalation phone and pager numbers. Inform customers of scheduled downtime, one or two line post-mortems on unscheduled repairs of dead systems, troubles in peer systems, etc. In other words, let us know you're alive and thinking of us. Once a week or even once a month would help! In any case, let's get going on making things work better. I realize hardware breaks, and systems break down. If there's anything we can do, please ask. Sincerely, Tom Jennings Recap follows: 22 March, the line went down about 0530PST. Functional again at 0920. Error rate remained high; arranged loopback testing for later that night, 2300PST. Obtained ticket #1523 (stayed in force through out this fiasco). 23 March, INSD calls and says they "forgot" to test. Scheduled testing for that night at 2300PST. 1630PST that afternoon, line fails. INSC can't loop our DSU. A number of calls in the next few hours, "we're still looking". INSC asks who installed the line, are there any CO's between us, who's responsible, etc. I point out, it was required that Sprint install the line themselves to avoid this very problem. Eventually INSC dispatches an MFS tech to the site; he arriaves in one hour. 2130PST, MFS can loop our equipment, cannot see Sprint line. The following were involved: actual equip failure in Stockton; staff-failure left some loopback jumper in place; router protocol changed from PPP to HDLC. Loopback testing postponed once again. 24 Mar 1300 Scheduled downtime for 0000 - 0200PST for PLSC to repair lash-up in place now. 25 Mar 94 0000PST. INSC says testing deferred again; our interface is broken, and would not run with patch removed. They thought slot was bad; turns out it's the LMU. We're still on the patch, after much fooling around the problem was not found. We knew this; INSC thinks they're testing when we had arranged repair (why test a known-bad system?). Ticket is still open. 25 Mar, 0057PST. Presumably repair has begun; I called to check progress. Not testing -- LMU pulled or something. Someone pulled the patch cords out. In and out as they try to substitute a new LMU. Contradictory stories of bad LMU, "LMU incompatible with [customers] DSU", bad slot. Slot 788. We're apparently patched into an adjacent slot. 0151 we're running on patches again. Arrangement was made: tech at the slot/LMU site said they can replace the LMU when some daytime-only tech crew gets in and downtime was to be "less than one minute". We arranged that iff (iff == IF and ONLY IF) downtime would be < 5 minutes, to do it approx. noon with prior notice to me. 26 Mar 1100PST line down. INSC doesn't know what is wrong. When I insisted on immediately escalating, all I got in response was "I will page my manager". I insisted, quite formally, "I want to escalate this to the 2nd highest level, it has been more than 6 hours downtime." Again he said he'd call his manager. At this point, I lost it and became decidedly non-polite. NOC tried to leave it PLSC will test. I insisted this was not acceptable. On this call or another within the first half hour of arriving here (noon?) I was asked, "do you want reglar reports? Half our, every hour? The next shift comes in at 11:30-pm" I told them, do not even talk to me about 1130 tonight, this line must be up immediately. When I got off the phone with INSC, I then attempted to call all the numbers in the SprintLink quotation, and found many of them inoperable. Only Bob Castelli returned my call. (In the hour that followed, Steve Burke, Helen ??? called back. Helen said she now had that pager number, and would contact people for me). Bob collet apparently called the NOC. NOC kept insisting the PLSC was going to test RSN. I received contradictory reports. were called at 1450PST and told me the patchcord in place had become intermittent. INSC wanted to then arrange more "testing" and I said no, I need to stay up, I'll make arrangements later. INSC staff then told me, "we were told the customer will not tolerate any downtime". I told him this was absolutely not true, we had arranged up to 2-hour downtime windows and that sprint had been unable to meet them, and downtime Wed alone was 10 hours, unscheduled. We are more than practical and willing to work with Sprint. Sprint has the problem not us. I asked INSC, can you tell me approximately how much downtime is needed to repair (plug in new module? total rewire? 10 minutes? 2 hours? 4 hours? 6 hours? etc) if we knew the requirement we could then schedule. 10 minutes easy, 2 hrs at night, > 2 we may have a problem. --- Tim Pozar was in our 444 Market POP at the time of the outage, and was following up on his own as well. Tim Pozar - Fri Mar 25 12:50:03 PST 1994 Called Sprint trouble number (800-726-0201) and was told that there is an existing trouble ticket (ticket # 4442449) and that the line was up and Sprint would be doing after hours testing tonight. I told him I was at the site and saw that the line is down. He argued the point with me and finally got it. I also told him that this is a major emergency with us. He put me on hold and during that time the line came back up. I was then transfered to "James Baker", a tech for Sprint. He told me that the patch in stockton had a bad connector on the patch cord. It was replaced in the last 5 minutes. His number is 404-859-8106. He told me that the ticket will stay open for testing on the line tonight. -- Tim Pozar / TLG / PO Box 410923 / San Francisco CA 94141-0923 POTS: +1 415 487-1902 After assurances that all would go well this time, with Sprint people breathing down their necks, I arranged for a four-hour repair window, 2100PST 26 March, Saturday. 26 Mar 94 2221PST. Line went down, within window. Called INSC to ask status. Patrick said that during the patch-up, they changed line card, LMU and M13 which was bad. But none of this explains the patch, bad slot, additional downtimes, etc, which were all supposedly related to the failed hardware? And pulled patch cords? The "new" ticket system was down for maintenance, so Patrick could not tell me exactly what PLSC had done, or if they had removed the patch (eg. replaced broken equip). 0115PST line up again. Patrick cleared the stats for our port, serial #2 router stockton 3. We can ask for stats later to compare. 27 Mar INSC calls, says they tested last night, found no errors. No repairs were done. I asked why they tested a known bad line, what happened to the repair order. *** I lost a log entry here. 27 Mar arranged testing/repair, once again. Patrick in INSC, Cathy Rogers in PLSC, someone in Stockton. We are all on the phone. Various testing, etc. Bad wire-wrap (?) found in rack, preventing substitution from working. Wide assortment of bad components found. 31 Mar 94 Arranged end-to-end BERT testing of our link, 0900PST Saturday, 2 APril. MFS ticket #9403300049. Jerry Mercado in Stockton at 9:00am. 02 Apr 94 0930PST ticket #1523 INSC said there's some issue re: where MFS will test to. Explained the MFS jack in our POP, ordered by Sprint. Sprint will test to our jack please. 1105 Full hour of testing. No errors found. Rebooted router gw1-sf, counters are now all 0's except 0/32 abort errors. Low numbers of errors persist in our router; I now believe them to not likely in our Sprint line. May be router bug, other problem. Very low numbers.