Learn from Other's Mistakes

Excerpt from the IBM mainframes newsgroup

H o m e



Date:         Thu, 4 Jan 2001 15:42:59 -0500
From:         Mike Ray Buechlein 
Subject:      Learn from Other's Mistakes

This may be a bit off topic, but since this is a forum for sharing
information I thought it might be appropriate.  Ever had one of those
days where everything seemed to be going fine.  You're doing your job,
cruising along and then, "Uh oh",  you take the system down and it
requires a Standalone restore to bring it back up?  I was just wondering
if anyone would be willing to share some of the "Oopsies" they've
encountered during their careers,  whether you've done it or it's
happened to someone else you know.    Perhaps it will help some of us
avoid the same type of mistakes in the future.

The one that sticks out most in my mind was when I was a new Systems
Programmer (about 2 months of experience).  One other new person and
myself were assigned to do the daily DASD maintenance (this was before
we had SMS and DASD management was manual process).  One of my tasks was
to go Volume to Volume compressing data sets to clear up extents using
3.4  I was going along, thinking I was doing a wonderful job and then I
hear a Captain (I was in the Air Force) from across the room say, "Why
can't I log on?"  Then several others said they couldn't either.  Then
my cohort asked what I had compressed.  I told him, "Some PDS called
SYS1.LINKLIB"  Fortunately, he remembered seeing our trainer use "F
LLA,REFRESH" and thought he'd try it.  Fortunately, it was fixed before
anything major had happened.

Anyone care to share one?

- Mike Buechlein
  Systems Programmer
  National Processing Company
  Louisville, KY 40213
  mbuechlein@npc.net


Date: Thu, 4 Jan 2001 14:50:15 -0600 From: "McKown, John" Subject: Re: Learn from Other's Mistakes Oh, "'Fess Up" time, is it? The worst thing that I ever did was mess up the ATCCON00 member in the VTAM library. OOPS, VTAM won't come up. No TSO. No other MVS images available either. Luckily this was many years ago. We still had a card reader, card punch, and keypunch machine. I punched up a job to unload the ATCCON00 member, corrected it, punched up an IEBUPDTE job to replace ATCCON00 and got VTAM up. John McKown HealthAxis All opinions are my own and are not the opinions of my employer.
Date: Thu, 4 Jan 2001 15:01:12 -0600 From: Edward Gould Subject: Re: Learn from Other's Mistakes John, Sounds like you were lucky (GRIN). We had one sysprog (not me) screw it up and because we didn't have a card reader he was SOL. It was a major outage, can we all say stand alone restore together? Management was to cheap to get a card reader, what can one say? I still insist to this day on having a card reader on all systems. Ed
Date: Fri, 5 Jan 2001 15:49:30 +0100 From: "Vernooy, C.P. - SPLXM" Subject: Re: Learn from Other's Mistakes A card reader won't solve all your problems. A set op IPL-able volumes can save a lot more. Kees
Date: Thu, 4 Jan 2001 15:06:20 -0600 From: "McKown, John" Subject: Re: Learn from Other's Mistakes I don't think that I need a card reader/punch/keypunch any more since I now have four separate MVS images with totally shared DASD. If one system is kaput, I'll just use one of the others for recovery (We connect to each LPAR via TCP/IP on an EMIF'd CISCO, so there is no problem with finding a working terminal). John McKown HealthAxis All opinions are my own and are not the opinions of my employer.
Date: Thu, 4 Jan 2001 15:51:02 -0500 From: "Metz, Seymour" Subject: Re: Learn from Other's Mistakes Well, this didn't affect anyone but me, but it was a classic "I can't believe that I just did that" snafu. I was developing some applications in SAS, running under CMS, and wanted to get rid of a file list that I no longer needed. Right after I hit Enter I realized that I had transposed two words and typed EXEC CMS ERASE where I meant to type ERASE CMS EXEC. My finger was heading to PA1 before I saw the first line of output, but I still lost a few files and had to restore them from backups. Shmuel (Seymour J.) Metz
Date: Thu, 4 Jan 2001 15:02:52 -0600 From: Rick Fochtman Organization: Board of Trade Clearing Corporation Subject: Re: Learn from Other's Mistakes I once pulled the same "boner" to a different library: it was an OS/360 system and I compressed SYS1.SVCLIB. (A number of SVC modules contained TTR pointers to the next module in the chain!) MAJOR OOOOOOOPS! Luckily, I had a single-pack system I could bring up and rerun IEHIOSUP on my prod system. Lots of unhappy college kids 'cuz we were down for an hour during finals week!
Date: Thu, 4 Jan 2001 16:13:36 -0500 From: Mike Ray Buechlein Subject: Re: Learn from Other's Mistakes I'm glad to know that I'm not the only one that's goofed up that way. *grin* I've thought of a couple more. These didn't happen too me, but happened to people I knew. We had a Captain that was using ICKDSF to initialize a volume. His device address statement was off by one number and he initialized the SYSRES. Another was from the same Captain (since he initially installed the system). We had two separate processors, each with it's own SYSRES (using 3380 DASD at this point). Several years later (after he was gone) we had an HDA crash and it took both systems down. Turns out that we were using shared DASD between the two processors and that both SYSRES volumes were on the same stack of platters. Just different addresses for the HDA's. Was the worst 18 hours of my life. 8( - Mike Buechlein Systems Programmer National Processing Company Louisville, KY 40213 mbuechlein@npc.net
Date: Thu, 4 Jan 2001 15:11:21 -0600 From: John Eatherly Subject: Re: Learn from Other's Mistakes We had a dasd guy do a FDR dump of our whole site. We just sat and watched data sets disappear and could not figure out what was happening. Took several days to restore.
Date: Thu, 4 Jan 2001 16:13:13 -0500 From: "Sitko, Bob" Subject: Re: Learn from Other's Mistakes FDR dumps delete the datasets?
Date: Thu, 4 Jan 2001 15:15:00 -0600 From: John Eatherly Subject: Re: Learn from Other's Mistakes I am not sure how he did it. It was a finger check in his job. It happened about 10 years ago.
Date: Thu, 4 Jan 2001 16:21:48 -0500 From: Bruce Black Organization: Innovation Data Processing Subject: Re: Learn from Other's Mistakes "Sitko, Bob" wrote: > > FDR dumps delete the datasets? No, but probably he did an ABR ARCHIVE which does delete them after backing them up. No doubt he messed up his SELECT statements and selected most of the datasets in the shop. -- Bruce A. Black Senior Software Developer for FDR, CPK, ABR, SOS, UPSTREAM, FATS/FATAR Innovation Data Processing Little Falls, NJ 07424 973-890-7300 personal: bblack@fdrinnovation.com sales info: sales@fdrinnovation.com tech support: support@fdrinnovation.com
Date: Thu, 4 Jan 2001 15:25:27 -0600 From: John Eatherly Subject: Re: Learn from Other's Mistakes He selected all of them.
Date: Thu, 4 Jan 2001 16:21:38 -0500 From: "Wenger, Joseph" Subject: Re: Learn from Other's Mistakes I used to work for a well known software vendor in distant galaxy many many many years ago and we were developing a Macro CICS based financial system which at the time was state of the art both technically and functionally. For many months we would code and test by bringing up the appropriate releases of CICS, test and then shut down, and then repeat the cycle for various system configurations. After the QA and Beta was done, a very large client with multistage network sites was to be our first client. We flew into their corporate office data center in all our glory, proceeded to unload our system, install the product which was a very large and a lengthy process, on a large production machine (their choice), fired up our system, checked it out with flying colors, and being very proud of the great job we did........ proceeded, out of force of habit, to issue an immediate shutdown to CICS.....bringing down their entire CICS production network without warning throughout the country. I wished I had a camera to get the look on everyone's face (theirs and ours) the very next moment after we all realized what just happened. The fan blades got big dents from all the stuff that proceeded to hit them. But it was a shared gotcha, we shouldn't have been on a production machine (their call), and the shutdown command should have been secured. Had it not been for that, it would have been a very very long walk back home. That's my story and I'm sticking to it.
Date: Thu, 4 Jan 2001 15:23:00 -0600 From: Ned Hedrick Subject: Re: Learn from Other's Mistakes A number of years ago, working on my first MVS system, I "allowed" a PROCLIB to migrate. Next time I IPL'ed, I couldn't bring up JES because the PROCLIB needed to be recalled, but I couldn't bring up DFHSM to recall it because JES wasn't active. It's funny to think about now, but then it caused a lot of perspiration!!! Ned Hedrick Gordmans, Inc. (www.gordmans.com) Omaha, NE, USA
Date: Thu, 4 Jan 2001 16:45:53 -0500 From: Jim Horne Subject: Re: Learn from Other's Mistakes Been there, done that (something very similar). And, when JES couldn't come back up after the IPL, it gave me the opportunity to do some testing I had been meaning to do. Two hours later, I walked into my boss's office and informed him that I had successfully tested my MVS standalone restore procedure. I'm not sure what I would have done if the "test" hadn't worked. :) Jim Horne Lowe's Companies
Date: Thu, 4 Jan 2001 16:30:25 -0500 From: "Cummings, Jennifer (ECSS)" Subject: Re: Learn from Other's Mistakes Ok.....three months after I started my first job I was working on a test application and was having problems getting it to work Unfortunately, I was testing in a production CICS 2.1.2 region. The region periodically crashed all day. The senior systems programmers were all over the application programmers. I finally figured out what was wrong with my program.......I had an out of bounds array. I asked the CICS systems programmer if an out of bounds could cause a region to crash and he promptly told me he would ring my neck (these were not the actual words) if I ever ran another program in a production CICS region again.....I told he there should have been better security in the region........(that was my second mistake).
Date: Thu, 4 Jan 2001 15:35:37 -0600 From: Eric Bielefeld Subject: Re: Learn from Other's Mistakes I had several instances over the past 22 years where I have caused IPLs. This incident was one where I saved the system from a situation where it couldn't be IPL'd. I happened to be looking at our production GDG storage packs looking at stuff that was unneeded so I could free up space. Behold, I find the production PROCLIB on one of these packs. I didn't think much of it at first, but then I realized that all PROCLIB datasets should be on my master catalog pack. This was back in SP1.3.6, even before XA. The PROCLIB datasets had to be in the master catalog, or have VOL=SER= & UNIT= on the DD card in the JES2 proc if it wasn't in the master catalog. Fortunately, I discovered it before the weekly IPL, or JES2 would never have come up. We didn't have a RESCUE pack or any other system to make changes on at that time. I finally figured out who the likely culprit was, the operations supervisor, and he confessed. I did get a promise from him to let me know before he tried doing any other things like that. Eric Bielefeld Sr. MVS Systems Programmer P&H Mining Equipment Milwaukee, WI 414-671-7849 ebie@hii.com
Date: Thu, 4 Jan 2001 17:46:25 -0500 From: "Mullen, Patrick" Subject: Re: Learn from Other's Mistakes I have a similar PROCLIB tale from the days of pre-XA. I was cleaning up the master catalog and one of the datasets I chose to move to a usercat was a PROCLIB. Next IPL, no JES2... So no panic, we started a stand alone restore of the catalog pack. It was at this time that we came to the horrible realization that our DFDSS backup cycles were a tad on the short side, being less than the elapsed time between IPLs... My change was already on every backup we had. Now it was time to panic! I was saved by finding and restoring a backup that had been gathering dust at the bottom of the "miscellaneous tapes used by the sysprogs" cupboard since our initial MVS installation (we had converted from VM/VSE) about 2 years previously.
Date: Thu, 4 Jan 2001 15:39:51 -0600 From: Alan Schwartz Subject: Re: Learn from Other's Mistakes How many others (I can't have been the only one) omitted a comma in IEASYS00... and of course it was before the PAGE= parameter. I wish I could remember how I got around the "insufficient paging resources" error.
Date: Thu, 4 Jan 2001 17:50:10 -0500 From: Bob Rutledge Subject: Re: Learn from Other's Mistakes When I did that to myself I was building an MVS for the first time all by my lonesome at a remote site 120 miles from home. What I did after some serious introspection, coffee and nicotine (and what you probably did) was try to remember (or guess) what the DSNs of three page datasets were and start IPLing and replying PAGE=... until I got three that worked. And immediately thereafter dumped sysres, before trying to fix my blunder. Bob
Date: Thu, 4 Jan 2001 16:30:47 -0500 From: Lockwood Lyon Subject: Re: Learn from Other's Mistakes You won't believe this one ... At a major site in the Midwest U.S. some time ago they had just major upgraded h/w to 3090, OpSys to MVS. New operations staff as well. Wellsir, as luck would have it, my cubemate's initials were JS. Our TSO user-Ids were our initials (First, Middle, Last) followed by a single digit (guess where this is going?!). One day, operations called my companion's phone. He wasn't there, so I picked it up. A harried operator said, "Hey, John, one of your jobs has been running on our system for more than two days! Can I cancel it?" Not knowing any better, I said, "Sure". After all, how long can a compile run? Well, jeepers, things got really hairy after the operator cancelled "test" job JES2. A red letter day for operations training, standards, and newbieness. Told you that you wouldn't believe it ... - - LL Lockwood Lyon -- Meijer Technical Support (616) 735-7553 (office) (616) 791-5131 (fax) Copyright (c) 2000 by Lockwood Lyon. All rights reserved. These opinions are mine and not necessarily those of my employer, Meijer, Inc.
Date: Thu, 4 Jan 2001 15:45:48 -0600 From: "McKown, John" Subject: Re: Learn from Other's Mistakes How can you cancel JES2? the command "c jes2" should be rejected because JES2 should be non cancelable (force jes2 may work, but I'm not checking to make sure!). But this does remind me of a story which I don't know if it is true or not. Way back, in the MVT and HASP days. I was told one operator would constantly issue "c hasp" for some reason. He wouldn't stop even though he was told never to do that (I don't know why he wasn't fired.) Well, HASP was written to intercept the MVT console commands. So the sysprog modified HASP to check the operand of the CANCEL command. Next time the operator did a "c hasp", he received a very profane message to do something that is anatomically impossible John McKown HealthAxis All opinions are my own and are not the opinions of my employer.
Date: Thu, 4 Jan 2001 16:36:02 -0500 From: Dave Cole Subject: Re: Learn from Other's Mistakes >Anyone care to share one? Sure, why not. Once upon a time, in the dim and distant pass (the mid 70s), JES2 source mods were my passion. I was working for Yale at the time, and I had spent something like the last 24 hours or so writing and testing some changes that I was making to the HASPINIT module. At the time I had already written some rudimentary breakpointing and instruction stepping support routines, and so I was sitting in the machine room at an operator's console, just barely hanging onto consciousness, running a secondary JES2, stepping through some checkpoint initialization code. The particular mod I was testing required me to manually defeat various checkpoint integrity checks. As I was stepping along, I remember getting rather irritated at these damned warning WTORs that JES2 was throwing at me, and so every time one came up, I would find a branch to zap, then resume to let the code go merrily on its way. I did this two or three times. The third time ... Well, do you know how awesomely quite a machine room sounds when an entire row of 1403 printers suddenly stops? It turned out that I was not testing a secondary JES2, but rather a second copy of the primary JES2! And that I had cold started the primary checkpoint record! That mistake didn't just crashed the system (a no big deal thing in those days anyway.) What I had done was wipe out somewhere around a thousand jobs. The effect was felt university wide. When what I had done was understood, the operations manager very gently suggestion that I should go home and get some sleep. (Tom O'Neil was one of the nicest guys I've ever known. He's no longer with us, and from time to time, I miss him.) Dave Cole REPLY TO: dbcole@colesoft.com Cole Software WEB PAGE: http://www.colesoft.com 736 Fox Hollow Road VOICE: 540-456-8536 Afton, VA 22920 FAX: 540-456-6658
Date: Thu, 4 Jan 2001 17:05:00 -0500 From: Martin Strudwick Subject: Re: Learn from Other's Mistakes Fortunately, it was fixed before anything major had happened. We had a nameless (clueless, whatever) employee reverse the DD's for a RACF database copy to prep an LPAR, effectively wiping out RACF in one fell-swoop. We had to restore from a full volume back-up, 1/2 hour downtime. I also remember a systems guy at another shop compressing PROD1.LOADLIB while CICS had it allocated to DFHRPL. Ouch! Had to bounce Production CICS at that point. 15 minute outage. Martin
Date: Thu, 4 Jan 2001 16:30:35 -0600 From: Edward Gould Subject: Re: Learn from Other's Mistakes > >Fortunately, it was fixed before anything major had happened. > > >We had a nameless (clueless, whatever) employee reverse the DD's for a RACF >database copy to prep an LPAR, effectively wiping out RACF in one >fell-swoop. We had to restore from a full volume back-up, 1/2 hour >downtime. There is a *RUMOR* going around several Chicago shops. An operator at a ******** ipled with a date far into the future. Causing a lot of datasets to be deleted. They also had some RACF issues... ask one of the frequent contributors to this list for the details. Ed
Date: Thu, 4 Jan 2001 18:04:51 -0500 From: Bob Rutledge Subject: Re: Learn from Other's Mistakes We're somewhat east of Chicago, but we once upon a long time ago had an operator miss the year by one while IPLing for a time change early on a Sunday. Of course nobody noticed that the system had warped a year into the future and of cource nobody noticed that the TLMS scratch list (this was long enough ago that it was printed) was many, many times its normal height. Until our users about mid-morning on the following Monday started wondering just what was going on and where their tapes had gone. We took the works down for the better part of a day to put the tape library back together. The only "good" thing that happened was that none of the IMS log tapes had been re-used. Bob
Date: Sun, 7 Jan 2001 17:09:41 -0600 From: Edward Gould Subject: Re: Learn from Other's Mistakes >We're somewhat east of Chicago, but we once upon a long time ago had >an operator >miss the year by one while IPLing for a time change early on a Sunday. Of >course nobody noticed that the system had warped a year into the future and of >cource nobody noticed that the TLMS scratch list (this was long >enough ago that >it was printed) was many, many times its normal height. Until our users about >mid-morning on the following Monday started wondering just what was >going on and >where their tapes had gone. > >We took the works down for the better part of a day to put the tape >library back >together. The only "good" thing that happened was that none of the IMS log >tapes had been re-used. > >Bob ----Snip------ Bob, It wasn't you ... but their first name does begin with R:) Ed
Date: Fri, 5 Jan 2001 13:27:07 +0100 From: Beate Kawelke Organization: debis Systemhaus Abtlg. TSA-DI Subject: Re: Learn from Other's Mistakes I heard a similar story from somebody here in Germany. Seems like the operator entered '98 instead of '89 and all their RACF userids were revoked during startup... Me ? Well, long time ago I asked the operator for the "main console" because I wanted to initiate a dump of an adress space. This was in the days when there was a seperate JES3 console. I entered "DUMP COMM=..." and immediately brought down JES3 - it just asked what sort of dump it should take. Ouch. The best thing was listening to the lady at the help desk - she told the people on the phone that "the experts are already looking into the problem". She didn't mention that one would-be-expert started it ;-) And then there was the day I installed a new release of our self-written software. We had some problems, so we went back to the previous release (which had run soothly for months). To my horror, that release didn't work naymore - in fact the database was destroyed several times. It took some time to find out that I still had the *newer* release of the administration interface loaded in my TSO session. It used a new feature which wasn't supported by the started task on the other end - thus killing the database every time I checked the system's status... Beate
Date: Fri, 5 Jan 2001 08:33:10 -0600 From: Eric Bielefeld Subject: Re: Learn from Other's Mistakes I had a similar experience around 1987. We were converting some major systems from DOS to MVS running under VM. On Sunday morning, the operator had to IPL VM so we could make the V=R guest for MVS bigger. He IPL'd with next years date, and when we IPL'd MVS, it picked up the same date. I finally noticed it while working in TSO maybe an hour after the IPL. We shut everything down and reipled with the correct date. The only problem we had was any TSO user who had logged on during that period couldn't log on again, including myself. I finally was able to reach someone with full RACF authority, and logged on with their ID and reset everyone. That led to my always having at least 2 TSO IDs with special authority. The thing that bothered me most was that Sunday was the day of the Cart Indy car race at State Fair Park in Milwaukee. I missed one of the support races, but did see the main race. Eric Bielefeld Sr. MVS Systems Programmer P&H Mining Equipment Milwaukee, WI 414-671-7849 ebie@hii.com >>> deerhome@IX.NETCOM.COM 01/04/01 05:04PM >>> We're somewhat east of Chicago, but we once upon a long time ago had an operator miss the year by one while IPLing for a time change early on a Sunday.
Date: Thu, 4 Jan 2001 16:26:27 -0600 From: "Blaicher, Chris" Subject: Re: Learn from Other's Mistakes This did not happen to me, but a friend of mine. Picture this: Disaster recover test. Operations isolates a machine from the very large DASD complex, theoretically. My friend's job is to run a job to clean up the catalog for a 13,000 data set DB2 system so that they can re-allocate and load the recovery files. Operations isolated everything except the catalogs! My friend asked operations if all was ready. He asked the DASD people if they were ready. He asked the project manager if he was ready. All said go, so he let the job go. Everybody wondered why the production DB2 system started to fail a few seconds later. The CIO wanted him fired, NOW. Luckily, the people who were really at fault said what really happened and he kept his job. Oh, and the real data set restores that had to be done? It seems that people had not been checking the outputs of the backup jobs, and over half the 13,000 data sets had NO backups. Luckily, they could create the data from other sources, but it did upset things for about 2 weeks. One thing that I did do was to compress SYS1.LINKLIB on a OS/MFT system. Only problem is IEBCOPY was an overlay structure and dies when segments get moved. Only make that mistake once. Chris Blaicher
Date: Thu, 4 Jan 2001 15:32:02 -0800 From: Bob Richards Subject: Re: Learn from Other's Mistakes Several come to mind: Had a pompous Sr. Sysprog 16 years ago who thought very highly of himself. Always put me down, especially if the Tech Support Mgr was within earshot. Well, one weekend, he made simultaneous changes to both SYS1.PARMLIBs on two CECs AND IOCDS changes also!!!! Guess what happened? Yup, he took BOTH machines down at the same time, switched the IOCDS datasets and attempted IPLs on both. Neither came up. He tried everything he knew and finally called me, near tears. It seems he was supposed to be at a wedding in 30 minutes (best man). I told him to go and I'd fix it. When I got there, I told the operators to leave the room, because there was no way I was going to let that pompous *ss know how I resolved it. I calmly switched to an IOCDS that I had saved that he didn't write over, got my standalone restore tape out, restored from my weekly backup, IPL'ed one system, corrected the SAME mistake he had made in both PARMLIBs. Next I took that system down, switched to his IOCDS datasets, and both systems IPL'ed correctly. It was probably the only time I have been overtly SMUG on Monday morning. Bottom line: Always know more than your boss! And ALWAYS, ALWAYS have a backout for everything you do. The second mistake WAS my fault. Same time period. Using SMP/E, I accidently RESTORED everything that was not ACCEPTed (six months worth of maintenance, several thousand PTFs). No problem you say? Well, there would not have been, except Operations cancelled the job, after 20 hours, to IPL even after I had told them not to do it. It took an IBM PSR and myself "28" hours more to get the SMP/E environment back to some semblance of where it was before. God bless PSRs! Hated to see them disappear. There are other war stories I could tell, but I have promised to protect the guilty on those! ===== Bob Richards, OS/390 Consultant Internet: richardsrb@yahoo.com
Date: Thu, 4 Jan 2001 15:48:14 -0800 From: John Donnelly Subject: Re: Learn from Other's Mistakes Comments: To: mbuechlein Content-type: text/plain; charset=us-ascii Two real happenings.... Model brought into computer room for some publicity photos of a 360/75...this is a few years ago...model sits at console in front of system...watches flashing lights...is quite amazed...model told to "do something"...model selects big button and pushes...model has just powered off the 360/75... Operator coming on shift has habit of issuing a $DJ1-9999 command just to get grip on what is in system...this is a service bureau...operator comes on shift one day and issues $CJ1-9999...all output for all customers just lost...
Date: Thu, 4 Jan 2001 16:01:46 -0600 From: Russell Witt Subject: Re: Learn from Other's Mistakes Had a similar experience with the JES2 proc, had someone remove a the "vol=ser= and unit=" since the proclib was catalog (but not in the master, which was why the vol=ser and unit= where there to begin with). The real problem is that we didn't have a rescue pack at the time. So I (remote emergency sysprog at the time) told them to restore the master-catalog volume from the last weekly backup (that is the pack where the proclib's where also stored). They said fine and would call be in a couple of hours when everything was back up. They called in less the 15 minutes asking what volser's the backup was on? I asked how I was supposed to know that, how do they normally look up volsers; "simply do a LISTC or run a TMSGRW report (UCC-1 back then of course)" was their reply. Turned out they had NEVER done a printed report of the contents of the tape library, and if the system was down they couldn't do an online inquiry. On the upside, it turns out that by simply flipping a switch the string of dasd for this "dedicated system" could be access'ed by another running system. Once the switch was flipped, the packs varied online, it took an entire 30-seconds to correct the problem and re-ipl. Still, I always felt sorry for the operators that spent 2 hours looking at the tapes in the library one reel at a time to see the label on it (we still used gummed labels back then) trying to find the last 2-volume backup.
Date: Thu, 4 Jan 2001 18:41:47 -0600 From: Len Rugen Subject: Re: Learn from Other's Mistakes How about hardware. We had a IPL pack (3375) crash on a VM system. No problem, stop the system, call CE. Count down to 3rd box on the string and replace the HDA. Try to IPL and fail, repeat a few times just to make sure. The disk was still toast. It turns out that the first 2 boxes were one string, then another string was reversed and butted up against the first, we had swapped the wrong HDA!
Date: Fri, 5 Jan 2001 13:50:46 +1000 From: "Ginnane, Shane" Subject: Re: Learn from Other's Mistakes All the best ones are simple; - the afore-mentioned COMPRESS of linklib - using IEFBR14 to delete a PDS member - "losing" LPALIB midway through a stage-1 - cleaning up redundant PAGE datasets while some other system was happily using them. - ditto redundant "user" catalogs that some-one else just happens to be using as a master cat ..... Sometimes you wonder if having trainees is really as beneficial as it's made out to be. Shane ...
Date: Fri, 5 Jan 2001 06:09:29 -0600 From: reza heydarpour Subject: Re: Learn from Other's Mistakes Not disagreeing, but adding another side: I've known some BSP's who really get a kick out of this: jumping up & down & pointing fingers & kicking raise their immage(to 'mngmnt') boost their EGO Infact once the junior SP asked the senior guy to review the exit & the plan, the senior OK'd it. When the system hang'd, the senior guy was running around shouting 'he' did it ... ...Reza >> Sometimes you wonder if having trainees is really as beneficial as it's made out to be.
Date: Thu, 4 Jan 2001 22:17:54 -0600 From: "Joel C. Ewing" Subject: Re: Learn from Other's Mistakes I can remember several that made a lasting impression: (1)My first "near death" experience with MVS, the one that prompted me to build a one-drive stand alone system, was when I changed something in the JES2 proc that caused a JCL error and couldn't get JES2 up at the next IPL--the first time I realized how helpless you are if there is any failure in getting JES2, VTAM, or TSO functional. I got lucky and had a volume backup of SYSRES along with a DISKMAP and was able to recover using a stand alone restore of just the tracks containing SYS1.PROCLIB; otherwise, a full volume restore would have back leveled some other installation data sets on SYSRES and caused additional confusion. (2)In early DFP 3.0 days, had an ICF catalog go belly up, totally unusable. Happened just an hour after our once a day catalog backup. We were able to track by various manual techniques what data sets had changed in the interim, and with 5 people keying furiously we were able to get things back in sync and resume normal operation in 4 or 5 hours. We felt incredibly lucky it happened just after, not just before the backup, and also on a weekend. After that we increased catalog backup frequency to 4 or 5 a day, added weekly full catalog diagnose runs, and invested in a product capable of catalog forward recovery from SMF records (and spent several weeks chasing down ICF catalog bugs with level 2). (3) More recently made a "trivial" PROCLIB member change, so trivial I made it on both the production and test sytems, violating my normal rule of not changing both until surviving an IPL. You guessed -- at next weekly IPL time neither system would. Had to do a standalone restore of one-drive test system (the easiest for SA restore) in order to fix the production system. Learned the hard way to never put same change, no matter how trivial, on both systems without an intervening IPL test. (4)Operations called one weekend. Console log filling up with nasty red error messages indicating I/O errors on RACF database. Asked what had been running. Defrags. Oops. RACF backup data set was on a volume recently added to conditional defrag list, system does not enqueue on this data sets, and defrag had moved database out from under RACF. Fortunately had many backups and functional test system from which to restore. Have since taken multiple steps, any of which should be sufficient to prevent reoccurrence. (5)Then there were the usual number of near misses from hardware failures, the 3380 and 3390 HDA failures that convinced us the mirroring and RAID-5 were definitely the only way to go; and the forced cold start and loss of our JES2 queues that taught us that concurrent RAMAC-2 maintenance isn't always. -- Joel C. Ewing, Fort Smith, AR jcewing@acm.org
Date: Fri, 5 Jan 2001 07:15:16 -0500 From: Dave Jousma Subject: Re: Learn from Other's Mistakes Ok me too.... When I worked for a large outsourcing company a couple of years ago as an MVS SYSPROG, one of my Storage Admin compatriots was busy setting up a new customer in our shared datacenter. We had a central tech support lpar that had access to *all* dasd in the shop. This is where all SMPE was done, and all other tech support software maintenance. Well, this Storage guy was busy initing strings of DASD for this new customer prior to their restores. Since he was doing several hundred volumes, he had some automation in place to autoreply to the 'U' for ICKDSF, and coded noverify on the job. Well guess what, while he thought he was initing unused disk, it was really used, and several customers lost real data, and in some cases their systems too...... Glad I wasn't him.... Dave
Date: Fri, 5 Jan 2001 07:42:25 -0500 From: William Ball Subject: Re: Learn from Other's Mistakes I'm sure we ALL have "war stories" or as I have heard them referred to "learning experiences", "Ah Shits", "chances to excel" (by any other name) Two come to mind: The sysprog decided to move ALL of the Proclibs off of the RES pack over the weekend. It was a time when DASD mant. was semi-automated. We had developed some procedures that were run daily to clean off work packs etc. You guessed it he moved them to a work pack. That wouldn't have been quite so bad because we DID have an exclusion list of datasets that were allowed to be on work packs, he just forgot to put the entries in the exclusion list. He made the move on Saturday and on Sunday night the work pack cleanup procedure was run. By the time I got in Monday morning, we were dead in the water. It took a little while to put it altogether and fortunately the weekly backups had been done just prior to him taking the system and I had the FDR SAR tape in my desk drawer. The second one I've kind of blotted the details from my mind but I had made a change to something on the RES pack that couldn't easily be backed out so my back out was going to be SAR of the pack if things went belly up. They did, so I pulled the current back up tapes and tryed to do the SAR and the tapes were junk. Now I'm reduced to going to the vault for the ONLY other set of backups we had and just to add a little more pressure, my boss wasn't aware of the problem and was pressuring me to hurry up so we could get to that LSU meeting in another town. Things worked out. The second set was good and there hadn't been anything but a couple of minor changes to the RES volume but I was starting to sweat bullets. We've also had people delete SYS1.PROCLIB. Fortunately I had track map listings and AGAIN FDR let me restore it to the tracks it had come from. And yes, we've also had the old COMPRESS SYS1.LINKLIB trick played on us. I got in a habit of making my own copy of IEBCOPY so that if LINKLIB did get compressed it wouldn't hose up IEBCOPY. And there was the time the company I was working for had "water sprinklers" in the computer room and one went off and dumped several thousand gallons of water. Fortunately (or not) there were holes in the floor and the water went on through to the basement......where the payroll department was with the checks and W2's and telephone room. There must have been 15-20 telephone trucks in the parking lot when I got there in the morning. Everyone had burnishing tools trying to clean up the telephone contacts that had gotten wet. The telephone system never did work right after that, they finally had to replace it. The funny thing is the shut off valve was located in the womens restroom with a chain and padlock around it and no one could find the key to the lock. They finally put a bolt cutter to it. Bill Ball Technical Support Kent State University
Date: Fri, 5 Jan 2001 08:41:46 -0500 From: William Ball Subject: Re: Learn from Other's Mistakes A few more I thought of: The garud didn't think the red button on the computer room wall did anything so he pushed it one night. He was looking for a new job before 8am. We had a "vendor" with THEIR version of OLTEP come in on a DASD problem one night. When it asked him what he wanted to do, he said all tests and to ALL drives. Where 'big blue" had disabled their code to only write to a pack named CEPACK, this vendor had not. It promptly walked it's way around to EVERY 3330 that was online and wiped out the vtoc. It took me the better part of a day to get them all back. The CE was moved to anther city before the next day was out. Back in the days when the master console was a hardcopy console (can you spell 360) and syslog wasn't even a gleam in someones eye yet, we had an operator that ran out of something to do and decided the "master console" never gets cleaned, so he decided to do it. And of course you can't get to those dirty old rods and springs without taking the rods holding the key bars out.....oooops. So to see how to correct what he had done, he decided a keypunch (remember those) was put together the same way and the took that apart.....ooooops. NOW he decides he has to call for service cause he can't get either of them back together. It was reported to me that the CE took one look (it was 3am) walked to the phone and called his manager and told him it WAS BILLABLE. Same operator some months later.....the programming staff had been working on a new system for over 6 months. The first night it ran they took their backups but the operator reset ALL of the tapes AFTER the system had created the label on the tapes....ooooops. The next night they needed those tapes to do a restore of the database......oooops. (Three ooops's is DEFINITELY worth one Ah Shit). The operator was transfered.....to me.....Ah Shit. I worked with him about a year when he decided moving freight around on a dock was better and quit....."thank you lord". Bill Ball Technical Support Kent State University
Date: Fri, 5 Jan 2001 08:37:06 -0500 From: "Burrell, C. Todd" Subject: Re: Learn from Other's Mistakes The worst screw-up I ever saw was when I worked in a JES3 shop. One of our two production systems got hung up one night, and around 3 in the morning the operator IPL'ed. He called me back around 4:30 and said that JES3 had not initialized yet, but he was seeing some format messages! The genius had replied 'C' to the startup parm for JES3 and confirmed it, thus cold starting and wiping out our 15 spool volumes. He swore that I had told him to do this, but all the lies in the world did not save this guys' job. C. Todd Burrell Senior MVS Systems Programmer CDC Atlanta 1600 Clifton Rd. Bldg 16 Room 2309 Atlanta GA 30333 Office: (404) 639-7648 Cell: (404) 630-3654 CBURRELL@CDC.GOV
Date: Fri, 5 Jan 2001 09:35:30 -0500 From: Bruce Black Organization: Innovation Data Processing Subject: Re: Learn from Other's Mistakes Long time ago I worked in a data center with large computer room. Some genius put in a Halon system with nozzles only at one end of the room, assuming that the high-pressure Halon would fill the room. I was in there doing system maint one weekend when the system discharged accidentally. I hid under a keypunch (it was a long time ago) until the maelstrom finished. When I came out, it looked like a war zone, every bit of paper, every piece of equipment not fastened down, was all over the place. 3330 disk covers (remember them) were smashed against the back wall. They replaced the system with a series of low pressure nozzle all around the room, I wonder way? Why did the system discharge? the Halon system maintenance people had just entered the room but they swore they never got near the system to discharge it. right! They tried to bill us for recharging the system. -- Bruce A. Black Senior Software Developer for FDR, CPK, ABR, SOS, UPSTREAM, FATS/FATAR Innovation Data Processing Little Falls, NJ 07424 973-890-7300 personal: bblack@fdrinnovation.com sales info: sales@fdrinnovation.com tech support: support@fdrinnovation.com
Date: Fri, 5 Jan 2001 08:40:55 -0600 From: "Babonas, Tony" Subject: Re: Learn from Other's Mistakes When I was new to MVS (coming from VM/VSE) one of my first assignments was DASD cleanup. My colleague sysprog showed me this wonderful utility called DFDSS that could move files from disk A to disk B. What a wonderful gadget. I started moving files and was making great progress in my first assignment. Suddenly phones began to ring............Seems PANVALET quit working, then other program products, then CICS. My colleague sysprog asked me if I had moved any load libraries, especially any in the linklist. I asked, "what's the linklist?"
Date: Fri, 5 Jan 2001 09:50:06 -0500 From: Sam Knutson Subject: Re: Learn from Other's Mistakes Content-type: text/plain; charset=us-ascii Big Rock Theory Wise old sysprog taught me to avoid the Big Rock Trap. The Big Rock is the one you hang over someone elses head then they beat you to death with the morning after:-) Never make changes that are left potentially active i.e. update PARMLIB member or LPALIB that will be used if an IPL occurs. One will when you are least expecting it and some other poor joe will have to figure out why the xyz stopped working or why the system won't come up. Essential Reading for avoiding common pitfalls MVS Systems Programming by Dave Elder-Vass Although not updated since 1993 still a wealth of good advice. Buy the newly reprinted copy at iuniverse Our price: $39.95 Format: Paperback Size: 8.25 x 11 Pages: 552 ISBN: 0-595-00184-X Publication Date: Mar-2000 http://www.iuniverse.com/marketplace/bookstore/book_detail.asp?isbn=0%2D595%2D00184%2DX You can view the pages on-line using the Browse Before you Buy tab on the book details page for MVS Systems Programming. You can also visit his web site http://www.mvsbook.fsnet.co.uk/ which has about 1/3 of the book's content on-line with other useful things. IBM's Redbook series the ABC's of Systems Programming at http://www.redbooks.ibm.com 1 http://www.redbooks.ibm.com/abstracts/sg245597.html 2 http://www.redbooks.ibm.com/abstracts/sg245852.html 3 http://www.redbooks.ibm.com/abstracts/sg245853.html 4 http://www.redbooks.ibm.com/abstracts/sg245654.html 5 http://www.redbooks.ibm.com/abstracts/sg245655.html If you can find it "The Systems Programmers Problem Solver" by William S. Mosteller ISBN 0-89435-271-7 has some great nuggets. Thanks, Sam Knutson
Date: Fri, 5 Jan 2001 09:03:45 -0600 From: Ed Billowitz Subject: Re: Learn from Other's Mistakes Several folks described some situations in which they ended up doing a stand alone restore. My story starts when I was called very, very early one morning by a sysprog whose changes didn't work. He went to his backout plan to do a stand alone restore. That also didn't work It seems that dss maintenance changed the format of backup tapes and the stand-alone program was never regenerated. I called a friend at an installation with a similar MVS software level, and he arranged for me to pick up a stand-alone dss tape from the operator. Unfortunately, an ipl showed it to be a stand-alone dsf tape, contrary to the external label. Another call to my friend letting him know about his exposure, and our mounting panic. He got dressed and came in to create the tape. (Did I mention he was a newly-wed, and a very good friend.) Fortunately it worked, and we were up just as our scheduled down-time window ended. Ed Billowitz ebillow@mcvh-vcu.edu
Date: Fri, 5 Jan 2001 10:45:19 -0500 From: Carol Srna Subject: Re: Learn from Other's Mistakes Wow!! Now that's what I call a friend. :-) Another call to my friend letting him know about his exposure, and our mounting panic. He got dressed and came in to create the tape. (Did I mention he was a newly-wed, and a very good friend.) Fortunately it worked, and we were up just as our scheduled down-time window ended. Ed Billowitz ebillow@mcvh-vcu.edu
Date: Fri, 5 Jan 2001 09:37:25 -0600 From: "McKown, John" Subject: Re: Learn from Other's Mistakes I've thought of a few more mistakes that I've done or have heard of. 1) Many years ago. I did an EXPORT DISCONNECT for the ACF2 high level qualifier. Next IPL, ACF2 won't initialize properly. ACF2 was partially initialized and would not allow anything else to come up at all. Not JES2, not *anything*. No disaster res (too expensive to let tech services have an entire 3350 to themselves, totally wasted). Luckily the ACF2 support people helped me and I just happened to have an alternate COMMNDxx member which did not try to start ACF2. When I IPLed without attempting to start ACF2, I could get things running by replying to a WTOR issued for every resource access. This was way back when installing ACF2 required front-ending a number of IBM modules by relinking (1987 or so - I think) 2) Not me, but I was told that a sysprog wanted to implement a JES3 modification. Too much of a bother to work up an IEBUPDTE deck + SMP/E usermod. The person simply used the ISPF editor to edit the source code. Unfortunately, he had the habit of automatically do a "RENUM" before doing a SAVE. He destroyed all the sequence numbers. No more JES3 PTFs could be applied. The last backup was from BEFORE the last major SMP/E run which did massive JES3 maintenance. All gone (including the sysprog). 3) Again not me, but I know from personal experience. We had two systems with shared DASD. The two systems rarely wrote to the other system's volumes, but would on rare occasion. The senior sysprog had heard that genning DASD as SHARED would result in (in his words) extremely poor performance. So he didn't. He never understood why we would end up with a corrupted VTOC on some packs on rare occasion. From what a friend in operations told me, this "problem" had been occuring for years. It took two of us almost 3 months of constant arguments to finally convince this person to gen the DASD as shared. As an aside, this was back in MVT days. Not a "problem" but this same person did a new MVT gen almost every weekend trying to "fine tune" the parameters. Not that anybody was complaining - everything was running just fine. He also said NO to running HASP since it had "too much overhead". 4) This was mine. Back in the 3330 days. I needed to have a disk initialized. I was at a class, so I left instructions to operations about what to do. I left the card deck + instructions with them. The only step that I left out was "remove old volume & mount new volume". They ran the job and reinitialized a production pack. Luckily for me I put the new VTOC in a different location from the old VTOC. The DOS/VS sysprog ran DITTO on the DOS system and changed the VTOC pointer in the label to point to the old VTOC. I only destroy one file by overwriting it with the new VTOC. IOW - get your instructions right! 5) I didn't do this, but I fixed it. A company that I used to work for ran UCC-2 (DUO). At the time, they had 3350 DASD. After I left, they were moving to 3380s. The person who took my place knew even less about DOS than I did. He used HSM to archive a DOS "library" (equivalent to a maclib - I forget the correct term just now) from a 3350 and restore it to a 3380. Unfortunately, that level of library contained MBBCCHHR pointers to members instead of TTRs. And to HSM, this looked like a BDAM file. The library was totally hosed. Luckily, there was a tape backup, made with the proper DOS utility, which I was able to use to recreate the library on the 3380. I don't know what you can learn from this. Does anybody still use DUO or whatever it may be called now?
John McKown HealthAxis All opinions are my own and are not the opinions of my employer.
Date: Fri, 5 Jan 2001 10:43:15 -0500 From: "Metz, Seymour" Subject: Re: Learn from Other's Mistakes "Opportunity"? I recall installing the TSO Command Package and discovering that a whole bunch of PTFs listed in the SUP keyword were actually prerequisites. No, problem, I'll just restore the whole shebang from the backup. Imagine my delight when I got a label check, and a scan revealed that the tape had someone else's data set on it. I naturally assumed that someone had overwritten my dump, until I looked at the date. It seems that the tape library had two tapes with the same volume serial number, and they couldn't find the one with my dump on it. I was not a happy camper. Why "opportunity"? Because this was one of the incidents that inspired me to write (_ is Enya): The boss said that he understood why I would want to wait A week or two to run the PUT, but not six months or eight He made me do a mass APPLY, but later said to me The reason for the long delay I now begin to see! Ma_ana, ma_ana, ma_ana is soon enough for me. I once installed a FUNCTION, with grief it filled my cup It had a lot of PTFs inside the keyword SUP But oh my friends and ah my foes, guess what it did to me When it turned out those PTFs should have been on the PRE! Ma_ana, ma_ana, ma_ana is soon enough for me. My system crashed this morning, and would not IPL I called my friendly PSR and he told me "well That mandatory PTF I told you to APPLY: If you run JES2 or JES3, your system she will die!" Ma_ana, ma_ana, ma_ana is soon enough for me. The comments mentioned prereqs, I said "Why do I care?", But it turned out that they forgot to put them on the VER. Oh, this preventive service would be alright with me, But I tried to do a LOCATE and it creamed my CVT. Ma_ana, ma_ana, ma_ana is soon enough for me. Those folks at Sterling Forrest I envy not one bit, For every single PUT cycle they're certain to get hit; There' a way they could help themselves and fill my heart with glee: Take the damn JES2 change team and teach it SMP! Ma_ana, ma_ana, ma_ana is soon enough for me. Shmuel (Seymour J.) Metz
Date: Fri, 5 Jan 2001 11:47:00 -0500 From: Carl Sommer Subject: Re: Learn from Other's Mistakes (it appears that I'm a younger pup than most on this list) My first job out of college was as an MVS systems programmer for IBM, as part of team that supported 70+ systems across 7 sites. The opportunity for screw ups was very high, and I certainly had my share. Some of these (or similar) have already been mentioned. My unique-most boo-boo would be creating a network storm between a couple of JES2 systems, when I didn't get the CONNECT stuff right. Seems I created a loop between two systems in Florida, all of which thought the other had a connection to Texas. But this thread brings back memories of many, many, beeper calls... +-------------------------------------------+ | Carl Sommer Carl.Sommer@netiq.com | | NetIQ Corporation www.netiq.com | | (919) 337-0251 Morrisville, NC | +-------------------------------------------+
Date: Fri, 5 Jan 2001 11:48:21 -0500 From: Mike Hall Subject: Re: Learn from Other's Mistakes When I was a brand-new sysprog back in the mid-80's, I did VM & CICS support for a small shop in Atlanta. We had a single 4341, and my boss & I were the only 2 sysprogs. I was doing my first install of a new release of VM, and had been testing it as a virtual machine under the primary VM. Now I was ready to do a stand-alone IPL test. This shop was so small that our system was shut down on the weekend, so one Saturday I went in to do my test. I brought my then 3-year-old son along with me. My test started out great - my VM install IPL'ed without a hitch, and I stood at the master console in triumph, amid the hum of the disk drives, savoring the moment of my first IPL of an OS I had installed myself. Triumph abruptly turned to panic, however, as the hum of the machines suddenly changed pitch. The drives were spinning down! I looked down at the VM console for any error messages; nothing unusual there. I hit the enter key - the console was frozen. Then I looked around, and saw my son walking from around the end of the CPU, pointing at it and looking very pleased with himself. The 4341 was a long, low box with a big red paddle-style power switch at one end, in easy reach of a small child, so naturally he had flipped it down, to the "off" position. We had clear plastic covers on all the disk drive switches and the IML/IPL buttons at the console, but that CPU power switch was just hanging out there, naked and unprotected. Now I really started sweating bullets, as I had recently read in the 4341 manual that once the switch was set to "off", an SE visit was required to reset the machine before it could be powered back on. I was afraid this would be a billable call, and I had been with this company for only a few months. I figured I was knee-deep in it at this point. I crossed my fingers and flipped the switch back to the "on" position. To my relief, the system powered right back up. I IPL'ed production VM, brought up VS1 & CICS, and tested the whole configuration. Everything worked fine. I decided not to push my luck, and called it a day. I kept the incident to myself, but a few days later, our system hung solid at mid-day, as it sometimes did. I was in the machine room with my sysprog boss working on the problem. He decided a power-on reset was called for, but instead of pressing the IML button at the console as was normal practice, he walked over to the CPU and flipped the power switch off and back on. Seeing my opening, I innocently asked, "Doesn't doing that require an SE to reset the CPU before you can power it back on?" "Nah," he replied, "I had the SE disable that feature, so we wouldn't have to call him in every time we needed to a power reset." Apparently this standard practice in many 4341 shops. A couple of months later, an operator managed to bump that switch accidentally (or so he said) and powered off the CPU in the middle of our big nightly batch run. A switch cover was installed on it soon afterwards.
Date: Fri, 5 Jan 2001 21:54:23 -0600 From: Lance Kopplin Subject: Re: Learn from Other's Mistakes Back in '75, running SVS, had some time before a vacation. Read the manual _carefully_ and made a change, last thing before vacation. When I got back, called the operator too ask how things were going. Got an earful of questions about "had I done anything?" Undid what I did, system straightened out. Eventually learned that the manual was wrong. We had systems located in Omaha, Des Moines and Minneapolis. We were JES2 then so the NJE connections were numbered. Omaha had several systems so the operators were used to using the JES2 command to route output from N4 or N6 to another environment that had a printer attached. One night, N4 got typoed to N5. After a while, Des Moines noticed that their output was disappearing. Analysis showed the problem and a simple command shut down the line. So they got smug and took a break or whatever. While JES2 was noticing the path through Minneapolis. One of our System Programmers renumbered the JES2 RMT's. This was at a time in history that we gave him the nickname - Moammar and the suicide JES parms. RACF datasets filled at 8 o'clock on a Friday evening. We had adjourned to the local bar/dive at logical five (it's five o'clock somewhere). The operators knew where to call to find me and I wandered back. I noticed that they seemed very intent on watching me at the keyboard. Then my mind focused well enough to notice that I was typing a few letters, and then pausing. When I paused my finger stayed on the last letter and the typamatic feature stuckkkkkkkkkkkkkkkkkkkkkkkkkkk. Without thinking, back up and type a few more. This was apparently much more entertaining that I normally am. We used those RACF datasets for about six years. Looking at a dump of a catalog abend on a Tuesday morning, the trace table seemed to indicate a serialization problem. Looked at a UCB, no reserve count. And the UCBTYP was unfamiliar. We had done a major DASD changeout over the weekend, and changed to using HCD panels. We had always done gens by copying the macros, and the share parameter didn't mean anything on the panel. Operations took an afternoon for a team building meeting out in the country, so we filled in on the consoles. Big storm came up, tornado warnings, sirens, the whole nine yards. Decided to stick it out at the console. We got our chilled water from a commercial source, they called from the basement to say that the chilled water was warming up (oh great) and the department head walked in the door with a flunky and said: "Oh, good. Kopplin is here". Didn't do a damn thing, the tornado missed our building, the water cooled off after the tornado went by, and nobody in operations knew there had been a tornado. Ya never know. Back in the '70s, an IMS bug wiped out a VTOC. Operations manager had them start the FDR restore, and after a few blinks on the 3420, had them cancel it (that's enough to get the VTOC back). The techs were busy for a few weeks after that, recovering data. We had a 168 that overheated. The CE figured out there was sand in the chilled water supply and backflushed the 168. Our Operations Manager had been a Machinist Mate in the Navy and watched. Next time it started overheating, he took a look, flipped some valves, and backflushed the 168. He was a _good_ Machinist Mate. Ya know this could become the never ending story.... Lance
Date: Sun, 7 Jan 2001 14:03:42 +1100 From: Ken Brick Organization: Brick Computer Services Pty. Ltd. Subject: Re: Learn from Other's Mistakes Mike Hall's mention of his son turning off a 4341 rung a bell. I used to take my son aged about 3 into work on the weekend to give the wife some relief. He found power buttons on the 3340 disk drives and also on a printer I didn't know existed (the power off buttons not the units). After many years as a DOS/VS sysprog I made the switch to MVS and one day had an IPL scheduled for that evening. First IPL that I had to do. JES2 failed with a JCL error. Apparently during the day my manager had accidently deleted one of the PROCLIBS, realsed and restored it but didn't catalog it. Caused a 20 mile trip to our other site to build a SYSRES with a JES2 proc that worked. We changed the way we coded the proc after that incident. Another I remember because it involved an IBM mistake was we had a 4381 that was being upgraded to a dyadic. I went thru the stage 1 sysgen deck and found a single parameter, ACR from memory, that I thought needed changing and then a full sysgen. At upgrade planning review meeting IBM told us no it didn't and whats more they confirmed it the next day. The upgrade got done, the system IPLed and we only had one CP. Very fast full sysgen had to be performed with red faced IBMers getting in the way. -- Ken Brick Brick Computer Services Ltd. kbrick@netspace.net.au PH: 613 9817 5506 Mob. 0409009764
Date: Sun, 7 Jan 2001 19:21:24 +0000 From: Simon Pawson Subject: Re: Learn from Other's Mistakes List, Apologies if this topic is dying off but I had to set up mail from home as I felt that I couldn't post this from where I work now (which is were I normally lurk on the list from). Please not that these events did NOT happen where my current contract is (although at least one other person on the list worked at the shop when these events happened.) I confess to only one of these. Problem (1) 'free Money' Just before Xmas several years ago..... Due to a minor bug a junior programmer was given the task of making a minor change to the software that ran this Banks ATM network. This was such a minor change to an urgent problem the change would go in via the 'patch' process rather than with a full system test. It got through unit testing (designed by the programmer himself just to test his change) and went live. Suddenly the branches reported that people were claiming not to be getting the correct amount of money from the ATM. The programmer had tidied up the code (as it comprised of very clumsy coding of double negative IF statements) in another part of the same module. What he had logically done is reverse the 10 and 5 note hoppers in the machines across the entire network so that if you asked for 50 instead of getting 4x10 and 2x5 you got 4x5 and 2x10. Of course if you asked for some amounts you got far more than you had asked for. Interestingly some customers worked this out and went back many times. Some lucky support people missed the pre Xmas office party writing a quick and dirty program to reverse the process so customers ended up being debited the correct amounts. Moral (1) Don't let the programmer test his own changes. There is no such thing as a minor change on an important customer facing system. Don't fix things that aren't broke. This individual also EPOed a centre once. Problem (2). The library was there a minute ago. Late one afternoon the shift ops phoned to say that IMS jobs that were starting on the International Funds management IMS system were JCLing, even though the IMS control region was up. It appeared that the IMS reslib had disappeared. What had happened is that (in those preGRS days) someone had run a job that either deleted or renamed the library on the wrong system, without a hold card. Sadly the rest of the suite that put in the new IMSGEN had hold cards so were just sitting there ready to run. Luckily the system was restored before the multi billion pounds of transactions missed their deadlines. Moral (2) Double check your JCL. Again. Problem (3) Disappearing ATMs One morning the ops phoned to say that they had had reports from the branches that their ATMs hadn't opened as normal. Instead of about 1700 cash dispensers there were only about 700 open. The branches with broken ATMs couldn't do any work themselves. We then tried to back out the change (which meant closing down the entire ATM and Branch network). The backout failed as the backup was corrupt so we had to rebuild the entire gen from scratch. It took hours. What had happened is that the member of staff who had built this system had decided without reference to anyone else to reorganise the source so all the terminal definitions were now in alphabetical order. Sadly his work wasn't very accurate and he removed lots of definitions that were required. Interestingly enough he ignored error messages that came out later in the process and continued without them. Sadly he had also been responsible for the backup process and copied PDS files to the MSS as if the 3330Vs were tapes. He then updated them as PDSes later on, thus corrupting the backups. Moral (3) Get someone to check your work or even write something to do the checking. Test the backups. If the system tells you you have made a mistake you never know you may have... This individual left and became the support manager for a software house. Problem (4) How to corrupt a database Again, on the last working day before Xmas the shift ops reported lots of error messages stating that the databases that held customer details on the ATM network were possibly corrupt with lots of transactions abending. There had been soft errors on a pack on this bank of drives the previous day. In the 80s recovering this amount of data would mean that the service would be out for the rest of the day. As you can imagine there was a lot of panic. Strangely enough though retrying the transactions seemed to work and the error messages moved across the packs involved and jumped to other databases on the packs. Suddenly they stopped with errors on fixed locations on the volumes. It transpired that one of the storage team had run Inspects from another CPU that had access to these volumes. Being keen I believe he had done the whole string. As his team leader was in the pub he couldn't double check that this was OK so he went ahead anyway. The ops had spotted job running and cancelled it - leaving the binary test pattern rather than the business data behind. The team leader was back from the pub very quickly. Moral (4). Think before you cancel that run-away job. Interestingly enough the team-leader got more trouble as he made a total disaster of recovering the databases. Luckily he was in the pub with his senior manager. If you are still reading by now I confess to number (2). If you want more such as how dual frame 3990s (or maybe 3880s) do really have a single point of failure and how an IBM engineer found it for us. Or how a simple JCL error delayed one of the biggest IBM ESPs ever let me know. Simon -- Simon Pawson
Date: Mon, 8 Jan 2001 08:43:58 +0100 From: Witold SCISLAK Subject: Re: Learn from Other's Mistakes Some years ago, at my previous job (a industrial enterprise, 1000-1500 mainframe terminals) I went down to HMC room to re-IPL my test LPAR. Just after I confirmed the IPL process I realized that I've just re-IPLed the p r o d u c t i o n LPAR. I was staying without any movement, pale, covered with cold sweat, waiting for furious telephones from the users but ... there was a dead silence. What is going ? I looked at the clock. It was 15:01. In this enterprise, for both, workers and clerks, the shift change time was 15:00. The first shift went home, the second does not started their work yet (at mainframe terminals at least.) ............... Pozdrowienia/Regards, Witek. - If you think education is expensive -- try ignorance - OS/390 Software Support
Date: Mon, 8 Jan 2001 11:49:58 -0500 From: Dave Juraschek Subject: Re: Learn from Other's Mistakes (1) Went to a shop once to upgrade a guest OS they ran under VM. Before going there, insisted that they do two backups of the OS being upgraded. The upgrade process required that you base install the new OS, then apply local mods, files, applications, accounts etc on top of the new OS code. Thus, the backup of the old OS was critical. For safety sake, did a DDR backup of everything upon arrival at the site. Installed new OS and when I went to add their local stuff back on top of the new OS, we discovered that both backups (as well as all they had been doing for 6+ months) were trash. Seems that there was an error message that nobody thought anything of. And eventhough this OS had utility that would read the backup and verify it's integrity, they had by policy opted to skip this verification, trusting that the untested backup was o.k. Good thing I had the DDR's. Had to restore their old system, fix the backup process. Run it again (twice - I always backup twice incase of I/O errors in the media). Run the verify job against the backup. Re-install the OS and re-apply the local stuff on top. Lost 18 hours for what should have been a 4 hour job. If it were not for my DDR's, everything would have been toast. After that, I've never compromised on the double backup discipline. (2) Same story as already told by others. Made a minor change to SYSRES (JES, I think). System wouldn't come up. Backup was bad - operators had re-used a tape and forced it to be accepted - blowing away the backup. Other backup had an I/O error. SOL and nauseated! Luckily, the night before a full system dump had been done. Had to hunt down it's output, figure out what tapes were used, restored SYSRES. IPLed & re-did change (checking very carefully this time - I knew what I had screwed up). Had system back up just inside maintenance window. Thanked God for His grace. -Dave
Date: Mon, 8 Jan 2001 12:46:18 -0600 From: Craig Otway Subject: Re: Learn from Other's Mistakes How about selecting a test lpar on the HMC to ipl and os/2 leaving the prod icon selected too.:)
Date: Wed, 10 Jan 2001 18:10:46 -0600 From: John Ford Subject: Re: Learn from Other's Mistakes Thought you'd see the last of this thread? I have three, from different shops I've worked at... 1) Male Power Peter Duffy's Electrical Connectors problem reminded me of this one. We were adding disk drives, and the electrician had run the power under the floor, terminated with the connector specified. I was pulling floor tiles to make sure all the cables were in the right place, and noticed the power connector. Being attracted to shiny metal things, I picked it up, and suddenly realized what I had in my hand was a MALE plug, with the prongs carrying 200+ volts, just waiting for a place to go. Sure enough, when I finally convinced the electrician to double-check it, they had ordered the wrong part number. The female plug's part number was one-off the male. Lesson: When an expert contradicts common sense, go with common sense. 2) Transfer to Oblivion New release of banking application going in, during the nightly batch window. Normal process is to run a job that deletes the backup load library, copy current version from production to backup, delete production library members, copy members from test library to production, and finally, delete the members from test. To rehearse the process, we would comment out the deletes, and point the "to" libraries to temporary datasets. For the actual cut-over, we'd un-comment the deletes, and point the "to" libraries to the real datasets. On this night, new "tech-support guy in charge of crap jobs" did everything right except for pointing the "to" libraries back to the real datasets -- he left them pointing to the temp datasets. Transfer job completed normally, but all libraries were empty. Yup, we started submitting compile jobs. Lesson: JCL and THC don't mix. 3) Unlabeled Tapes Implementing an automated tape management system is a long, manual chore -- cataloging all datasets & volsers, changing JCL, etc., culminating in the "leap of faith" step of removing the dataset labels from the tapes themselves. Tech support assured the operations manager that they weren't needed, and would just cause confusion if left on the tapes. Once we went live, and proved the concept, the ops manager had the night shift remove the labels from all the tapes in the racks. Next morning came the question, "How do we know which tape is which?" Seems the night shift wasn't told to leave the volser labels on the tapes. Fortunately, it was hundreds, not thousands, of tapes in the racks. Lesson: Instructions should include what NOT to do.
Date: Thu, 11 Jan 2001 08:31:41 -0600 From: Cliff Hess Subject: Re: Learn from Other's Mistakes I've resisted this thread for over a week, partially because I thought a similar story would be shared. I got a call late one Friday afternoon that one of the RVAs in the shop was getting moderate cache alerts which quickly turned into critical alerts. About 45 seconds after getting the severe cache alert, the phone rang. When an RVA hits 90% NCL, you get a call from IBM suggesting that you HTFQ (Hold Those Foolish Queues). After cleaning up this mess, we got down to the bottom of this issue. A year earlier when this RVA arrived, it had enough back end storage to support 128 addresses. Since the plan was to upgrade this box in a year, 256 addresses were genned to 'save time in the future'. That afternoon another guy in the unit, who had only been there for 6 months, got a call from the DB2. They wanted 16 volumes to 'try something' and not knowing the configuration situation, obliged them and went off to a meeting without telling anybody what he had done. I'd only been at this shop for a month so I didn't have a clue either. Of course, the genius behind all of this didn't remember this when he 'trained' the two of us. He suddenly remembered a meeting in another building as soon as the trouble started. It made for quite an interesting Friday afternoon that lasted until about 10:30 PM. Recounting this story has reminded me of a question that I've been meaning to ask. Does anybody know for sure what happens if an RVA hits 100% NCL? I've heard that the micro code gets confused and all data on the box is lost. But then again I also heard that: Jerry Mathers (the Beaver) was killed in Viet Nam in 1970. Mel Gibson had to have reconstructive plastic surgery on his face after a bar room fight when he was 18. If you leave a cat alone in a room with a baby, it will be attracted to smell of milk on it's breath and suck the life out of it.
Date: Wed, 17 Jan 2001 02:31:24 +0800 From: Ron & Jenny Hawkins Subject: Re: Learn from Other's Mistakes I tested this years ago on an original 9200 (like, it had HP drives in it). It just stops accepting writes, like when cache is full. In the shop where I set this up I allocated 2 volumes with 16MB of hard to compress data. If the box showed signs of going non corpus mentis we just deleted the volumes so it had some more room to write.


Back to home page