Learn from Other's Mistakes
Date: Thu, 4 Jan 2001 15:42:59 -0500
From: Mike Ray Buechlein
Subject: Learn from Other's Mistakes
This may be a bit off topic, but since this is a forum for sharing
information I thought it might be appropriate. Ever had one of those
days where everything seemed to be going fine. You're doing your job,
cruising along and then, "Uh oh", you take the system down and it
requires a Standalone restore to bring it back up? I was just wondering
if anyone would be willing to share some of the "Oopsies" they've
encountered during their careers, whether you've done it or it's
happened to someone else you know. Perhaps it will help some of us
avoid the same type of mistakes in the future.
The one that sticks out most in my mind was when I was a new Systems
Programmer (about 2 months of experience). One other new person and
myself were assigned to do the daily DASD maintenance (this was before
we had SMS and DASD management was manual process). One of my tasks was
to go Volume to Volume compressing data sets to clear up extents using
3.4 I was going along, thinking I was doing a wonderful job and then I
hear a Captain (I was in the Air Force) from across the room say, "Why
can't I log on?" Then several others said they couldn't either. Then
my cohort asked what I had compressed. I told him, "Some PDS called
SYS1.LINKLIB" Fortunately, he remembered seeing our trainer use "F
LLA,REFRESH" and thought he'd try it. Fortunately, it was fixed before
anything major had happened.
Anyone care to share one?
- Mike Buechlein
Systems Programmer
National Processing Company
Louisville, KY 40213
mbuechlein@npc.net
Date: Thu, 4 Jan 2001 14:50:15 -0600
From: "McKown, John"
Subject: Re: Learn from Other's Mistakes
Oh, "'Fess Up" time, is it?
The worst thing that I ever did was mess up the ATCCON00 member in the VTAM
library. OOPS, VTAM won't come up. No TSO. No other MVS images available
either. Luckily this was many years ago. We still had a card reader, card
punch, and keypunch machine. I punched up a job to unload the ATCCON00
member, corrected it, punched up an IEBUPDTE job to replace ATCCON00 and got
VTAM up.
John McKown
HealthAxis
All opinions are my own and are not the opinions of my employer.
Date: Thu, 4 Jan 2001 15:01:12 -0600
From: Edward Gould
Subject: Re: Learn from Other's Mistakes
John,
Sounds like you were lucky (GRIN). We had one sysprog (not me) screw
it up and because we didn't have a card reader he was SOL. It was a
major outage, can we all say stand alone restore together?
Management was to cheap to get a card reader, what can one say? I
still insist to this day on having a card reader on all systems.
Ed
Date: Fri, 5 Jan 2001 15:49:30 +0100
From: "Vernooy, C.P. - SPLXM"
Subject: Re: Learn from Other's Mistakes
A card reader won't solve all your problems.
A set op IPL-able volumes can save a lot more.
Kees
Date: Thu, 4 Jan 2001 15:06:20 -0600
From: "McKown, John"
Subject: Re: Learn from Other's Mistakes
I don't think that I need a card reader/punch/keypunch any more since I now
have four separate MVS images with totally shared DASD. If one system is
kaput, I'll just use one of the others for recovery (We connect to each LPAR
via TCP/IP on an EMIF'd CISCO, so there is no problem with finding a working
terminal).
John McKown
HealthAxis
All opinions are my own and are not the opinions of my employer.
Date: Thu, 4 Jan 2001 15:51:02 -0500
From: "Metz, Seymour"
Subject: Re: Learn from Other's Mistakes
Well, this didn't affect anyone but me, but it was a classic "I can't
believe that I just did that" snafu. I was developing some applications in
SAS, running under CMS, and wanted to get rid of a file list that I no
longer needed. Right after I hit Enter I realized that I had transposed two
words and typed EXEC CMS ERASE where I meant to type ERASE CMS EXEC. My
finger was heading to PA1 before I saw the first line of output, but I still
lost a few files and had to restore them from backups.
Shmuel (Seymour J.) Metz
Date: Thu, 4 Jan 2001 15:02:52 -0600
From: Rick Fochtman
Organization: Board of Trade Clearing Corporation
Subject: Re: Learn from Other's Mistakes
I once pulled the same "boner" to a different library: it was an OS/360
system and I compressed SYS1.SVCLIB. (A number of SVC modules contained TTR
pointers to the next module in the chain!) MAJOR OOOOOOOPS!
Luckily, I had a single-pack system I could bring up and rerun IEHIOSUP on
my prod system. Lots of unhappy college kids 'cuz we were down for an hour
during finals week!
Date: Thu, 4 Jan 2001 16:13:36 -0500
From: Mike Ray Buechlein
Subject: Re: Learn from Other's Mistakes
I'm glad to know that I'm not the only one that's goofed up that way. *grin*
I've thought of a couple more. These didn't happen too me, but happened to
people I knew.
We had a Captain that was using ICKDSF to initialize a volume. His device
address statement was off by one number and he initialized the SYSRES.
Another was from the same Captain (since he initially installed the system).
We had two separate processors, each with it's own SYSRES (using 3380 DASD at
this point). Several years later (after he was gone) we had an HDA crash and
it took both systems down. Turns out that we were using shared DASD between
the two processors and that both SYSRES volumes were on the same stack of
platters. Just different addresses for the HDA's. Was the worst 18 hours of
my life. 8(
- Mike Buechlein
Systems Programmer
National Processing Company
Louisville, KY 40213
mbuechlein@npc.net
Date: Thu, 4 Jan 2001 15:11:21 -0600
From: John Eatherly
Subject: Re: Learn from Other's Mistakes
We had a dasd guy do a FDR dump of our whole site. We just sat and
watched data sets disappear and could not figure out what was happening.
Took several days to restore.
Date: Thu, 4 Jan 2001 16:13:13 -0500
From: "Sitko, Bob"
Subject: Re: Learn from Other's Mistakes
FDR dumps delete the datasets?
Date: Thu, 4 Jan 2001 15:15:00 -0600
From: John Eatherly
Subject: Re: Learn from Other's Mistakes
I am not sure how he did it. It was a finger check in his job. It
happened about 10 years ago.
Date: Thu, 4 Jan 2001 16:21:48 -0500
From: Bruce Black
Organization: Innovation Data Processing
Subject: Re: Learn from Other's Mistakes
"Sitko, Bob" wrote:
>
> FDR dumps delete the datasets?
No, but probably he did an ABR ARCHIVE which does delete them after backing them
up. No doubt he messed up his SELECT statements and selected most of the
datasets in the shop.
--
Bruce A. Black
Senior Software Developer for
FDR, CPK, ABR, SOS, UPSTREAM, FATS/FATAR
Innovation Data Processing
Little Falls, NJ 07424
973-890-7300
personal: bblack@fdrinnovation.com
sales info: sales@fdrinnovation.com
tech support: support@fdrinnovation.com
Date: Thu, 4 Jan 2001 15:25:27 -0600
From: John Eatherly
Subject: Re: Learn from Other's Mistakes
He selected all of them.
Date: Thu, 4 Jan 2001 16:21:38 -0500
From: "Wenger, Joseph"
Subject: Re: Learn from Other's Mistakes
I used to work for a well known software vendor in distant galaxy many many many years
ago and we were developing a Macro CICS based financial system which at the time was state
of the art both technically and functionally. For many months we would code and test by bringing
up the appropriate releases of CICS, test and then shut down, and then repeat the cycle for various
system configurations. After the QA and Beta was done, a very large client with multistage network sites was to
be our first client.
We flew into their corporate office data center in all our glory, proceeded to unload our system, install
the product which was a very large and a lengthy process, on a large production machine (their choice), fired
up our system, checked it out with flying colors, and being very proud of the great job we did........
proceeded, out of force of habit, to issue an immediate shutdown to CICS.....bringing down their entire
CICS production network without warning throughout the country. I wished I had a camera to get
the look on everyone's face (theirs and ours) the very next moment after we all realized what just happened.
The fan blades got big dents from all the stuff that proceeded to hit them. But it was a shared gotcha,
we shouldn't have been on a production machine (their call), and the shutdown command should have been
secured. Had it not been for that, it would have been a very very long walk back home.
That's my story and I'm sticking to it.
Date: Thu, 4 Jan 2001 15:23:00 -0600
From: Ned Hedrick
Subject: Re: Learn from Other's Mistakes
A number of years ago, working on my first MVS system, I
"allowed" a PROCLIB to migrate. Next time I IPL'ed, I
couldn't bring up JES because the PROCLIB needed to be
recalled, but I couldn't bring up DFHSM to recall it
because JES wasn't active.
It's funny to think about now, but then it caused a lot of
perspiration!!!
Ned Hedrick
Gordmans, Inc. (www.gordmans.com)
Omaha, NE, USA
Date: Thu, 4 Jan 2001 16:45:53 -0500
From: Jim Horne
Subject: Re: Learn from Other's Mistakes
Been there, done that (something very similar). And, when JES couldn't come
back up after the IPL, it gave me the opportunity to do some testing I had
been meaning to do. Two hours later, I walked into my boss's office and
informed him that I had successfully tested my MVS standalone restore
procedure. I'm not sure what I would have done if the "test" hadn't worked.
:)
Jim Horne
Lowe's Companies
Date: Thu, 4 Jan 2001 16:30:25 -0500
From: "Cummings, Jennifer (ECSS)"
Subject: Re: Learn from Other's Mistakes
Ok.....three months after I started my first job I was working on
a test application and was having problems getting it to work
Unfortunately, I was testing in a production CICS 2.1.2 region.
The region periodically crashed all day. The senior systems
programmers were all over the application programmers. I finally
figured out what was wrong with my program.......I had an out of
bounds array. I asked the CICS systems programmer if an out of
bounds could cause a region to crash and he promptly told me he
would ring my neck (these were not the actual words) if I ever
ran another program in a production CICS region again.....I told
he there should have been better security in the
region........(that was my second mistake).
Date: Thu, 4 Jan 2001 15:35:37 -0600
From: Eric Bielefeld
Subject: Re: Learn from Other's Mistakes
I had several instances over the past 22 years where I have
caused IPLs. This incident was one where I saved the system from
a situation where it couldn't be IPL'd. I happened to be looking
at our production GDG storage packs looking at stuff that was
unneeded so I could free up space. Behold, I find the production
PROCLIB on one of these packs. I didn't think much of it at
first, but then I realized that all PROCLIB datasets should be on
my master catalog pack. This was back in SP1.3.6, even before
XA. The PROCLIB datasets had to be in the master catalog, or
have VOL=SER= & UNIT= on the DD card in the JES2 proc if it
wasn't in the master catalog. Fortunately, I discovered it
before the weekly IPL, or JES2 would never have come up. We
didn't have a RESCUE pack or any other system to make changes on
at that time.
I finally figured out who the likely culprit was, the operations
supervisor, and he confessed. I did get a promise from him to
let me know before he tried doing any other things like that.
Eric Bielefeld
Sr. MVS Systems Programmer
P&H Mining Equipment
Milwaukee, WI
414-671-7849
ebie@hii.com
Date: Thu, 4 Jan 2001 17:46:25 -0500
From: "Mullen, Patrick"
Subject: Re: Learn from Other's Mistakes
I have a similar PROCLIB tale from the days of pre-XA. I was cleaning up the
master catalog and one of the datasets I chose to move to a usercat was a
PROCLIB. Next IPL, no JES2... So no panic, we started a stand alone restore
of the catalog pack. It was at this time that we came to the horrible
realization that our DFDSS backup cycles were a tad on the short side, being
less than the elapsed time between IPLs... My change was already on every
backup we had. Now it was time to panic!
I was saved by finding and restoring a backup that had been gathering dust
at the bottom of the "miscellaneous tapes used by the sysprogs" cupboard
since our initial MVS installation (we had converted from VM/VSE) about 2
years previously.
Date: Thu, 4 Jan 2001 15:39:51 -0600
From: Alan Schwartz
Subject: Re: Learn from Other's Mistakes
How many others (I can't have been the only one) omitted a comma
in IEASYS00... and of course it was before the PAGE= parameter.
I wish I could remember how I got around the "insufficient paging
resources" error.
Date: Thu, 4 Jan 2001 17:50:10 -0500
From: Bob Rutledge
Subject: Re: Learn from Other's Mistakes
When I did that to myself I was building an MVS for the first time all by my
lonesome at a remote site 120 miles from home.
What I did after some serious introspection, coffee and nicotine (and what you
probably did) was try to remember (or guess) what the DSNs of three page
datasets were and start IPLing and replying PAGE=... until I got three that
worked.
And immediately thereafter dumped sysres, before trying to fix my blunder.
Bob
Date: Thu, 4 Jan 2001 16:30:47 -0500
From: Lockwood Lyon
Subject: Re: Learn from Other's Mistakes
You won't believe this one ...
At a major site in the Midwest U.S. some time ago they had just
major upgraded h/w to 3090, OpSys to MVS. New operations staff
as well.
Wellsir, as luck would have it, my cubemate's initials were JS.
Our TSO user-Ids were our initials (First, Middle, Last) followed
by a single digit (guess where this is going?!). One day,
operations called my companion's phone. He wasn't there, so I
picked it up. A harried operator said, "Hey, John, one of your
jobs has been running on our system for more than two days! Can
I cancel it?" Not knowing any better, I said, "Sure". After
all, how long can a compile run?
Well, jeepers, things got really hairy after the operator
cancelled "test" job JES2. A red letter day for operations
training, standards, and newbieness.
Told you that you wouldn't believe it ...
- - LL
Lockwood Lyon -- Meijer Technical Support
(616) 735-7553 (office)
(616) 791-5131 (fax)
Copyright (c) 2000 by Lockwood Lyon. All rights reserved. These
opinions are mine and not necessarily those of my employer,
Meijer, Inc.
Date: Thu, 4 Jan 2001 15:45:48 -0600
From: "McKown, John"
Subject: Re: Learn from Other's Mistakes
How can you cancel JES2? the command "c jes2" should be rejected because
JES2 should be non cancelable (force jes2 may work, but I'm not checking to
make sure!).
But this does remind me of a story which I don't know if it is true or not.
Way back, in the MVT and HASP days. I was told one operator would constantly
issue "c hasp" for some reason. He wouldn't stop even though he was told
never to do that (I don't know why he wasn't fired.) Well, HASP was written
to intercept the MVT console commands. So the sysprog modified HASP to check
the operand of the CANCEL command. Next time the operator did a "c hasp", he
received a very profane message to do something that is anatomically
impossible
John McKown
HealthAxis
All opinions are my own and are not the opinions of my employer.
Date: Thu, 4 Jan 2001 16:36:02 -0500
From: Dave Cole
Subject: Re: Learn from Other's Mistakes
>Anyone care to share one?
Sure, why not.
Once upon a time, in the dim and distant pass (the mid 70s), JES2
source mods were my passion. I was working for Yale at the time,
and I had spent something like the last 24 hours or so writing
and testing some changes that I was making to the HASPINIT
module.
At the time I had already written some rudimentary breakpointing
and instruction stepping support routines, and so I was sitting
in the machine room at an operator's console, just barely hanging
onto consciousness, running a secondary JES2, stepping through
some checkpoint initialization code.
The particular mod I was testing required me to manually defeat
various checkpoint integrity checks. As I was stepping along, I
remember getting rather irritated at these damned warning WTORs
that JES2 was throwing at me, and so every time one came up, I
would find a branch to zap, then resume to let the code go
merrily on its way. I did this two or three times. The third time
...
Well, do you know how awesomely quite a machine room sounds when
an entire row of 1403 printers suddenly stops?
It turned out that I was not testing a secondary JES2, but rather
a second copy of the primary JES2! And that I had cold started
the primary checkpoint record! That mistake didn't just crashed
the system (a no big deal thing in those days anyway.) What I had
done was wipe out somewhere around a thousand jobs. The effect
was felt university wide.
When what I had done was understood, the operations manager very
gently suggestion that I should go home and get some sleep. (Tom
O'Neil was one of the nicest guys I've ever known. He's no longer
with us, and from time to time, I miss him.)
Dave Cole REPLY TO: dbcole@colesoft.com
Cole Software WEB PAGE: http://www.colesoft.com
736 Fox Hollow Road VOICE: 540-456-8536
Afton, VA 22920 FAX: 540-456-6658
Date: Thu, 4 Jan 2001 17:05:00 -0500
From: Martin Strudwick
Subject: Re: Learn from Other's Mistakes
Fortunately, it was fixed before anything major had happened.
We had a nameless (clueless, whatever) employee reverse the DD's for a RACF
database copy to prep an LPAR, effectively wiping out RACF in one
fell-swoop. We had to restore from a full volume back-up, 1/2 hour
downtime.
I also remember a systems guy at another shop compressing PROD1.LOADLIB
while CICS had it allocated to DFHRPL. Ouch! Had to bounce Production CICS
at that point. 15 minute outage.
Martin
Date: Thu, 4 Jan 2001 16:30:35 -0600
From: Edward Gould
Subject: Re: Learn from Other's Mistakes
>
>Fortunately, it was fixed before anything major had happened.
>
>
>We had a nameless (clueless, whatever) employee reverse the DD's for a RACF
>database copy to prep an LPAR, effectively wiping out RACF in one
>fell-swoop. We had to restore from a full volume back-up, 1/2 hour
>downtime.
There is a *RUMOR* going around several Chicago shops. An operator at
a ******** ipled with a date far into the future. Causing a lot of
datasets to be deleted. They also had some RACF issues... ask one of
the frequent contributors to this list for the details.
Ed
Date: Thu, 4 Jan 2001 18:04:51 -0500
From: Bob Rutledge
Subject: Re: Learn from Other's Mistakes
We're somewhat east of Chicago, but we once upon a long time ago had an operator
miss the year by one while IPLing for a time change early on a Sunday. Of
course nobody noticed that the system had warped a year into the future and of
cource nobody noticed that the TLMS scratch list (this was long enough ago that
it was printed) was many, many times its normal height. Until our users about
mid-morning on the following Monday started wondering just what was going on and
where their tapes had gone.
We took the works down for the better part of a day to put the tape library back
together. The only "good" thing that happened was that none of the IMS log
tapes had been re-used.
Bob
Date: Sun, 7 Jan 2001 17:09:41 -0600
From: Edward Gould
Subject: Re: Learn from Other's Mistakes
>We're somewhat east of Chicago, but we once upon a long time ago had
>an operator
>miss the year by one while IPLing for a time change early on a Sunday. Of
>course nobody noticed that the system had warped a year into the future and of
>cource nobody noticed that the TLMS scratch list (this was long
>enough ago that
>it was printed) was many, many times its normal height. Until our users about
>mid-morning on the following Monday started wondering just what was
>going on and
>where their tapes had gone.
>
>We took the works down for the better part of a day to put the tape
>library back
>together. The only "good" thing that happened was that none of the IMS log
>tapes had been re-used.
>
>Bob
----Snip------
Bob,
It wasn't you ... but their first name does begin with R:)
Ed
Date: Fri, 5 Jan 2001 13:27:07 +0100
From: Beate Kawelke
Organization: debis Systemhaus Abtlg. TSA-DI
Subject: Re: Learn from Other's Mistakes
I heard a similar story from somebody here in Germany. Seems like the
operator entered '98 instead of '89 and all their RACF userids were
revoked during startup...
Me ? Well, long time ago I asked the operator for the "main console"
because I wanted to initiate a dump of an adress space. This was in
the days when there was a seperate JES3 console. I entered "DUMP
COMM=..." and immediately brought down JES3 - it just asked what sort
of dump it should take. Ouch. The best thing was listening to the lady
at the help desk - she told the people on the phone that "the experts
are already looking into the problem". She didn't mention that one
would-be-expert started it ;-)
And then there was the day I installed a new release of our
self-written software. We had some problems, so we went back to the
previous release (which had run soothly for months). To my horror,
that release didn't work naymore - in fact the database was destroyed
several times. It took some time to find out that I still had the
*newer* release of the administration interface loaded in my TSO
session. It used a new feature which wasn't supported by the started
task on the other end - thus killing the database every time I checked
the system's status...
Beate
Date: Fri, 5 Jan 2001 08:33:10 -0600
From: Eric Bielefeld
Subject: Re: Learn from Other's Mistakes
I had a similar experience around 1987. We were converting some
major systems from DOS to MVS running under VM. On Sunday
morning, the operator had to IPL VM so we could make the V=R
guest for MVS bigger. He IPL'd with next years date, and when we
IPL'd MVS, it picked up the same date. I finally noticed it
while working in TSO maybe an hour after the IPL. We shut
everything down and reipled with the correct date. The only
problem we had was any TSO user who had logged on during that
period couldn't log on again, including myself. I finally was
able to reach someone with full RACF authority, and logged on
with their ID and reset everyone. That led to my always having
at least 2 TSO IDs with special authority. The thing that
bothered me most was that Sunday was the day of the Cart Indy car
race at State Fair Park in Milwaukee. I missed one of the
support races, but did see the main race.
Eric Bielefeld
Sr. MVS Systems Programmer
P&H Mining Equipment
Milwaukee, WI
414-671-7849
ebie@hii.com
>>> deerhome@IX.NETCOM.COM 01/04/01 05:04PM >>>
We're somewhat east of Chicago, but we once upon a long time ago had an operator
miss the year by one while IPLing for a time change early on a Sunday.
Date: Thu, 4 Jan 2001 16:26:27 -0600
From: "Blaicher, Chris"
Subject: Re: Learn from Other's Mistakes
This did not happen to me, but a friend of mine.
Picture this: Disaster recover test. Operations isolates a machine from
the very large DASD complex, theoretically. My friend's job is to run a job
to clean up the catalog for a 13,000 data set DB2 system so that they can
re-allocate and load the recovery files.
Operations isolated everything except the catalogs! My friend asked
operations if all was ready. He asked the DASD people if they were ready.
He asked the project manager if he was ready. All said go, so he let the job
go. Everybody wondered why the production DB2 system started to fail a few
seconds later. The CIO wanted him fired, NOW. Luckily, the people who were
really at fault said what really happened and he kept his job.
Oh, and the real data set restores that had to be done? It seems that
people had not been checking the outputs of the backup jobs, and over half
the 13,000 data sets had NO backups. Luckily, they could create the data
from other sources, but it did upset things for about 2 weeks.
One thing that I did do was to compress SYS1.LINKLIB on a OS/MFT system.
Only problem is IEBCOPY was an overlay structure and dies when segments get
moved. Only make that mistake once.
Chris Blaicher
Date: Thu, 4 Jan 2001 15:32:02 -0800
From: Bob Richards
Subject: Re: Learn from Other's Mistakes
Several come to mind:
Had a pompous Sr. Sysprog 16 years ago who thought very highly of
himself. Always put me down, especially if the Tech Support Mgr was
within earshot. Well, one weekend, he made simultaneous changes to both
SYS1.PARMLIBs on two CECs AND IOCDS changes also!!!! Guess what
happened? Yup, he took BOTH machines down at the same time, switched
the IOCDS datasets and attempted IPLs on both. Neither came up. He
tried everything he knew and finally called me, near tears. It seems he
was supposed to be at a wedding in 30 minutes (best man). I told him to
go and I'd fix it.
When I got there, I told the operators to leave the room, because there
was no way I was going to let that pompous *ss know how I resolved it.
I calmly switched to an IOCDS that I had saved that he didn't write
over, got my standalone restore tape out, restored from my weekly
backup, IPL'ed one system, corrected the SAME mistake he had made in
both PARMLIBs. Next I took that system down, switched to his IOCDS
datasets, and both systems IPL'ed correctly. It was probably the only
time I have been overtly SMUG on Monday morning.
Bottom line: Always know more than your boss! And ALWAYS, ALWAYS
have a backout for everything you do.
The second mistake WAS my fault. Same time period. Using SMP/E, I
accidently RESTORED everything that was not ACCEPTed (six months worth
of maintenance, several thousand PTFs). No problem you say? Well, there
would not have been, except Operations cancelled the job, after 20
hours, to IPL even after I had told them not to do it. It took an IBM
PSR and myself "28" hours more to get the SMP/E environment back to
some semblance of where it was before. God bless PSRs! Hated to see
them disappear.
There are other war stories I could tell, but I have promised to
protect the guilty on those!
=====
Bob Richards, OS/390 Consultant Internet: richardsrb@yahoo.com
Date: Thu, 4 Jan 2001 15:48:14 -0800
From: John Donnelly
Subject: Re: Learn from Other's Mistakes
Comments: To: mbuechlein
Content-type: text/plain; charset=us-ascii
Two real happenings....
Model brought into computer room for some publicity photos of a
360/75...this is a few years ago...model sits at console in front of
system...watches flashing lights...is quite amazed...model told to "do
something"...model selects big button and pushes...model has just powered
off the 360/75...
Operator coming on shift has habit of issuing a $DJ1-9999
command just to get grip on what is in system...this is a service
bureau...operator comes on shift one day and issues $CJ1-9999...all output
for all customers just lost...
Date: Thu, 4 Jan 2001 16:01:46 -0600
From: Russell Witt
Subject: Re: Learn from Other's Mistakes
Had a similar experience with the JES2 proc, had someone remove a the
"vol=ser= and unit=" since the proclib was catalog (but not in the master,
which was why the vol=ser and unit= where there to begin with). The real
problem is that we didn't have a rescue pack at the time. So I (remote
emergency sysprog at the time) told them to restore the master-catalog
volume from the last weekly backup (that is the pack where the proclib's
where also stored). They said fine and would call be in a couple of hours
when everything was back up. They called in less the 15 minutes asking what
volser's the backup was on? I asked how I was supposed to know that, how do
they normally look up volsers; "simply do a LISTC or run a TMSGRW report
(UCC-1 back then of course)" was their reply. Turned out they had NEVER done
a printed report of the contents of the tape library, and if the system was
down they couldn't do an online inquiry.
On the upside, it turns out that by simply flipping a switch the string of
dasd for this "dedicated system" could be access'ed by another running
system. Once the switch was flipped, the packs varied online, it took an
entire 30-seconds to correct the problem and re-ipl. Still, I always felt
sorry for the operators that spent 2 hours looking at the tapes in the
library one reel at a time to see the label on it (we still used gummed
labels back then) trying to find the last 2-volume backup.
Date: Thu, 4 Jan 2001 18:41:47 -0600
From: Len Rugen
Subject: Re: Learn from Other's Mistakes
How about hardware. We had a IPL pack (3375) crash on a VM system.
No problem, stop the system, call CE. Count down to 3rd box on the string
and replace the HDA. Try to IPL and fail, repeat a few times just to make
sure.
The disk was still toast. It turns out that the first 2 boxes were one
string,
then another string was reversed and butted up against the first, we had
swapped the wrong HDA!
Date: Fri, 5 Jan 2001 13:50:46 +1000
From: "Ginnane, Shane"
Subject: Re: Learn from Other's Mistakes
All the best ones are simple;
- the afore-mentioned COMPRESS of linklib
- using IEFBR14 to delete a PDS member
- "losing" LPALIB midway through a stage-1
- cleaning up redundant PAGE datasets while some other system was happily
using them.
- ditto redundant "user" catalogs that some-one else just happens to be
using as a master cat
.....
Sometimes you wonder if having trainees is really as beneficial as it's made
out to be.
Shane ...
Date: Fri, 5 Jan 2001 06:09:29 -0600
From: reza heydarpour
Subject: Re: Learn from Other's Mistakes
Not disagreeing, but adding another side:
I've known some BSP's who really get a kick out of this:
jumping up & down & pointing fingers & kicking
raise their immage(to 'mngmnt')
boost their EGO
Infact once the junior SP asked the senior guy to review the exit & the
plan, the senior OK'd it. When the system hang'd, the senior guy
was running around shouting 'he' did it ...
...Reza
>>
Sometimes you wonder if having trainees is really as beneficial as it's made
out to be.
Date: Thu, 4 Jan 2001 22:17:54 -0600
From: "Joel C. Ewing"
Subject: Re: Learn from Other's Mistakes
I can remember several that made a lasting impression:
(1)My first "near death" experience with MVS, the one that prompted me
to build a one-drive stand alone system, was when I changed something in
the JES2 proc that caused a JCL error and couldn't get JES2 up at the
next IPL--the first time I realized how helpless you are if there is any
failure in getting JES2, VTAM, or TSO functional. I got lucky and had a
volume backup of SYSRES along with a DISKMAP and was able to recover
using a stand alone restore of just the tracks containing SYS1.PROCLIB;
otherwise, a full volume restore would have back leveled some other
installation data sets on SYSRES and caused additional confusion.
(2)In early DFP 3.0 days, had an ICF catalog go belly up, totally
unusable. Happened just an hour after our once a day catalog backup.
We were able to track by various manual techniques what data sets had
changed in the interim, and with 5 people keying furiously we were able
to get things back in sync and resume normal operation in 4 or 5 hours.
We felt incredibly lucky it happened just after, not just before the
backup, and also on a weekend. After that we increased catalog backup
frequency to 4 or 5 a day, added weekly full catalog diagnose runs, and
invested in a product capable of catalog forward recovery from SMF
records (and spent several weeks chasing down ICF catalog bugs with
level 2).
(3) More recently made a "trivial" PROCLIB member change, so trivial I
made it on both the production and test sytems, violating my normal rule
of not changing both until surviving an IPL. You guessed -- at next
weekly IPL time neither system would. Had to do a standalone restore of
one-drive test system (the easiest for SA restore) in order to fix the
production system. Learned the hard way to never put same change, no
matter how trivial, on both systems without an intervening IPL test.
(4)Operations called one weekend. Console log filling up with nasty red
error messages indicating I/O errors on RACF database. Asked what had
been running. Defrags. Oops. RACF backup data set was on a volume
recently added to conditional defrag list, system does not enqueue on
this data sets, and defrag had moved database out from under RACF.
Fortunately had many backups and functional test system from which to
restore. Have since taken multiple steps, any of which should be
sufficient to prevent reoccurrence.
(5)Then there were the usual number of near misses from hardware
failures, the 3380 and 3390 HDA failures that convinced us the mirroring
and RAID-5 were definitely the only way to go; and the forced cold start
and loss of our JES2 queues that taught us that concurrent RAMAC-2
maintenance isn't always.
--
Joel C. Ewing, Fort Smith, AR jcewing@acm.org
Date: Fri, 5 Jan 2001 07:15:16 -0500
From: Dave Jousma
Subject: Re: Learn from Other's Mistakes
Ok me too....
When I worked for a large outsourcing company a couple of years
ago as an MVS SYSPROG, one of my Storage Admin compatriots was
busy setting up a new customer in our shared datacenter. We had
a central tech support lpar that had access to *all* dasd in the
shop. This is where all SMPE was done, and all other tech
support software maintenance. Well, this Storage guy was busy
initing strings of DASD for this new customer prior to their
restores. Since he was doing several hundred volumes, he had
some automation in place to autoreply to the 'U' for ICKDSF, and
coded noverify on the job. Well guess what, while he thought he
was initing unused disk, it was really used, and several
customers lost real data, and in some cases their systems
too......
Glad I wasn't him....
Dave
Date: Fri, 5 Jan 2001 07:42:25 -0500
From: William Ball
Subject: Re: Learn from Other's Mistakes
I'm sure we ALL have "war stories" or as I have heard them referred to
"learning experiences", "Ah Shits", "chances to excel" (by any other name)
Two come to mind:
The sysprog decided to move ALL of the Proclibs off of the RES pack over
the weekend. It was a time when DASD mant. was semi-automated. We had
developed some procedures that were run daily to clean off work packs etc.
You guessed it he moved them to a work pack. That wouldn't have been quite
so bad because we DID have an exclusion list of datasets that were allowed
to be on work packs, he just forgot to put the entries in the exclusion
list. He made the move on Saturday and on Sunday night the work pack
cleanup procedure was run. By the time I got in Monday morning, we were
dead in the water. It took a little while to put it altogether and
fortunately the weekly backups had been done just prior to him taking the
system and I had the FDR SAR tape in my desk drawer.
The second one I've kind of blotted the details from my mind but I had made
a change to something on the RES pack that couldn't easily be backed out
so my back out was going to be SAR of the pack if things went belly up.
They did, so I pulled the current back up tapes and tryed to do the SAR and
the tapes were junk. Now I'm reduced to going to the vault for the ONLY
other set of backups we had and just to add a little more pressure, my boss
wasn't aware of the problem and was pressuring me to hurry up so we could
get to that LSU meeting in another town. Things worked out. The second set
was good and there hadn't been anything but a couple of minor changes to
the RES volume but I was starting to sweat bullets.
We've also had people delete SYS1.PROCLIB. Fortunately I had track map
listings and AGAIN FDR let me restore it to the tracks it had come from.
And yes, we've also had the old COMPRESS SYS1.LINKLIB trick played on us. I
got in a habit of making my own copy of IEBCOPY so that if LINKLIB did get
compressed it wouldn't hose up IEBCOPY.
And there was the time the company I was working for had "water sprinklers"
in the computer room and one went off and dumped several thousand gallons
of water. Fortunately (or not) there were holes in the floor and the water
went on through to the basement......where the payroll department was with
the checks and W2's and telephone room. There must have been 15-20
telephone trucks in the parking lot when I got there in the morning.
Everyone had burnishing tools trying to clean up the telephone contacts
that had gotten wet. The telephone system never did work right after that,
they finally had to replace it. The funny thing is the shut off valve
was located in the womens restroom with a chain and padlock around it and
no one could find the key to the lock. They finally put a bolt cutter to it.
Bill Ball
Technical Support
Kent State University
Date: Fri, 5 Jan 2001 08:41:46 -0500
From: William Ball
Subject: Re: Learn from Other's Mistakes
A few more I thought of:
The garud didn't think the red button on the computer room wall did
anything so he pushed it one night. He was looking for a new job before 8am.
We had a "vendor" with THEIR version of OLTEP come in on a DASD problem one
night. When it asked him what he wanted to do, he said all tests and to ALL
drives. Where 'big blue" had disabled their code to only write to a pack
named CEPACK, this vendor had not. It promptly walked it's way around to
EVERY 3330 that was online and wiped out the vtoc. It took me the better
part of a day to get them all back. The CE was moved to anther city before
the next day was out.
Back in the days when the master console was a hardcopy console (can you
spell 360) and syslog wasn't even a gleam in someones eye yet, we had an
operator that ran out of something to do and decided the "master console"
never gets cleaned, so he decided to do it. And of course you can't get to
those dirty old rods and springs without taking the rods holding the key
bars out.....oooops. So to see how to correct what he had done, he decided
a keypunch (remember those) was put together the same way and the took that
apart.....ooooops. NOW he decides he has to call for service cause he can't
get either of them back together. It was reported to me that the CE took
one look (it was 3am) walked to the phone and called his manager and told
him it WAS BILLABLE. Same operator some months later.....the programming
staff had been working on a new system for over 6 months. The first night
it ran they took their backups but the operator reset ALL of the tapes
AFTER the system had created the label on the tapes....ooooops. The next
night they needed those tapes to do a restore of the database......oooops.
(Three ooops's is DEFINITELY worth one Ah Shit). The operator was
transfered.....to me.....Ah Shit. I worked with him about a year when he
decided moving freight around on a dock was better and quit....."thank you
lord".
Bill Ball
Technical Support
Kent State University
Date: Fri, 5 Jan 2001 08:37:06 -0500
From: "Burrell, C. Todd"
Subject: Re: Learn from Other's Mistakes
The worst screw-up I ever saw was when I worked in a JES3 shop. One of our
two production systems got hung up one night, and around 3 in the morning
the operator IPL'ed. He called me back around 4:30 and said that JES3 had
not initialized yet, but he was seeing some format messages!
The genius had replied 'C' to the startup parm for JES3 and confirmed it,
thus cold starting and wiping out our 15 spool volumes. He swore that I had
told him to do this, but all the lies in the world did not save this guys'
job.
C. Todd Burrell
Senior MVS Systems Programmer
CDC Atlanta
1600 Clifton Rd.
Bldg 16 Room 2309
Atlanta GA 30333
Office: (404) 639-7648
Cell: (404) 630-3654
CBURRELL@CDC.GOV
Date: Fri, 5 Jan 2001 09:35:30 -0500
From: Bruce Black
Organization: Innovation Data Processing
Subject: Re: Learn from Other's Mistakes
Long time ago I worked in a data center with large computer room. Some genius
put in a Halon system with nozzles only at one end of the room, assuming that
the high-pressure Halon would fill the room. I was in there doing system maint
one weekend when the system discharged accidentally. I hid under a keypunch (it
was a long time ago) until the maelstrom finished. When I came out, it looked
like a war zone, every bit of paper, every piece of equipment not fastened down,
was all over the place. 3330 disk covers (remember them) were smashed against
the back wall. They replaced the system with a series of low pressure nozzle
all around the room, I wonder way?
Why did the system discharge? the Halon system maintenance people had just
entered the room but they swore they never got near the system to discharge it.
right! They tried to bill us for recharging the system.
--
Bruce A. Black
Senior Software Developer for
FDR, CPK, ABR, SOS, UPSTREAM, FATS/FATAR
Innovation Data Processing
Little Falls, NJ 07424
973-890-7300
personal: bblack@fdrinnovation.com
sales info: sales@fdrinnovation.com
tech support: support@fdrinnovation.com
Date: Fri, 5 Jan 2001 08:40:55 -0600
From: "Babonas, Tony"
Subject: Re: Learn from Other's Mistakes
When I was new to MVS (coming from VM/VSE) one of my first
assignments was DASD cleanup. My colleague sysprog showed me
this wonderful utility called DFDSS that could move files from
disk A to disk B. What a wonderful gadget. I started moving
files and was making great progress in my first assignment.
Suddenly phones began to ring............Seems PANVALET quit
working, then other program products, then CICS. My colleague
sysprog asked me if I had moved any load libraries, especially
any in the linklist. I asked, "what's the linklist?"
Date: Fri, 5 Jan 2001 09:50:06 -0500
From: Sam Knutson
Subject: Re: Learn from Other's Mistakes
Content-type: text/plain; charset=us-ascii
Big Rock Theory
Wise old sysprog taught me to avoid the Big Rock Trap.
The Big Rock is the one you hang over someone elses
head then they beat you to death with the morning after:-)
Never make changes that are left potentially active i.e. update
PARMLIB member or LPALIB that will be used if an IPL occurs. One
will when you are least expecting it and some other poor joe will
have to figure out why the xyz stopped working or why the system
won't come up.
Essential Reading for avoiding common pitfalls
MVS Systems Programming
by Dave Elder-Vass
Although not updated since 1993 still a wealth of good advice.
Buy the newly reprinted copy at iuniverse
Our price: $39.95
Format: Paperback
Size: 8.25 x 11
Pages: 552
ISBN: 0-595-00184-X
Publication Date: Mar-2000
http://www.iuniverse.com/marketplace/bookstore/book_detail.asp?isbn=0%2D595%2D00184%2DX
You can view the pages on-line using the Browse Before you Buy tab on
the book details page for MVS Systems Programming.
You can also visit his web site
http://www.mvsbook.fsnet.co.uk/
which has about 1/3 of the book's content on-line with other useful things.
IBM's Redbook series the ABC's of Systems Programming
at http://www.redbooks.ibm.com
1 http://www.redbooks.ibm.com/abstracts/sg245597.html
2 http://www.redbooks.ibm.com/abstracts/sg245852.html
3 http://www.redbooks.ibm.com/abstracts/sg245853.html
4 http://www.redbooks.ibm.com/abstracts/sg245654.html
5 http://www.redbooks.ibm.com/abstracts/sg245655.html
If you can find it "The Systems Programmers Problem Solver"
by William S. Mosteller ISBN 0-89435-271-7 has some
great nuggets.
Thanks, Sam Knutson
Date: Fri, 5 Jan 2001 09:03:45 -0600
From: Ed Billowitz
Subject: Re: Learn from Other's Mistakes
Several folks described some situations in which they ended up doing a
stand alone restore. My story starts when I was called very, very early
one morning by a sysprog whose changes didn't work. He went to his backout
plan to do a stand alone restore. That also didn't work It seems that dss
maintenance changed the format of backup tapes and the stand-alone program
was never regenerated.
I called a friend at an installation with a similar MVS software level, and
he arranged for me to pick up a stand-alone dss tape from the operator.
Unfortunately, an ipl showed it to be a stand-alone dsf tape, contrary to
the external label.
Another call to my friend letting him know about his exposure, and our
mounting panic. He got dressed and came in to create the tape. (Did I
mention he was a newly-wed, and a very good friend.) Fortunately it
worked, and we were up just as our scheduled down-time window ended.
Ed Billowitz
ebillow@mcvh-vcu.edu
Date: Fri, 5 Jan 2001 10:45:19 -0500
From: Carol Srna
Subject: Re: Learn from Other's Mistakes
Wow!! Now that's what I call a friend. :-)
Another call to my friend letting him know about his exposure, and our
mounting panic. He got dressed and came in to create the tape. (Did I
mention he was a newly-wed, and a very good friend.) Fortunately it
worked, and we were up just as our scheduled down-time window ended.
Ed Billowitz
ebillow@mcvh-vcu.edu
Date: Fri, 5 Jan 2001 09:37:25 -0600
From: "McKown, John"
Subject: Re: Learn from Other's Mistakes
I've thought of a few more mistakes that I've done or have heard of.
1) Many years ago. I did an EXPORT DISCONNECT for the ACF2 high level
qualifier. Next IPL, ACF2 won't initialize properly. ACF2 was partially
initialized and would not allow anything else to come up at all. Not JES2,
not *anything*. No disaster res (too expensive to let tech services have an
entire 3350 to themselves, totally wasted). Luckily the ACF2 support people
helped me and I just happened to have an alternate COMMNDxx member which did
not try to start ACF2. When I IPLed without attempting to start ACF2, I
could get things running by replying to a WTOR issued for every resource
access. This was way back when installing ACF2 required front-ending a
number of IBM modules by relinking (1987 or so - I think)
2) Not me, but I was told that a sysprog wanted to implement a JES3
modification. Too much of a bother to work up an IEBUPDTE deck + SMP/E
usermod. The person simply used the ISPF editor to edit the source code.
Unfortunately, he had the habit of automatically do a "RENUM" before doing a
SAVE. He destroyed all the sequence numbers. No more JES3 PTFs could be
applied. The last backup was from BEFORE the last major SMP/E run which did
massive JES3 maintenance. All gone (including the sysprog).
3) Again not me, but I know from personal experience. We had two systems
with shared DASD. The two systems rarely wrote to the other system's
volumes, but would on rare occasion. The senior sysprog had heard that
genning DASD as SHARED would result in (in his words) extremely poor
performance. So he didn't. He never understood why we would end up with a
corrupted VTOC on some packs on rare occasion. From what a friend in
operations told me, this "problem" had been occuring for years. It took two
of us almost 3 months of constant arguments to finally convince this person
to gen the DASD as shared. As an aside, this was back in MVT days. Not a
"problem" but this same person did a new MVT gen almost every weekend trying
to "fine tune" the parameters. Not that anybody was complaining - everything
was running just fine. He also said NO to running HASP since it had "too
much overhead".
4) This was mine. Back in the 3330 days. I needed to have a disk
initialized. I was at a class, so I left instructions to operations about
what to do. I left the card deck + instructions with them. The only step
that I left out was "remove old volume & mount new volume". They ran the job
and reinitialized a production pack. Luckily for me I put the new VTOC in a
different location from the old VTOC. The DOS/VS sysprog ran DITTO on the
DOS system and changed the VTOC pointer in the label to point to the old
VTOC. I only destroy one file by overwriting it with the new VTOC. IOW - get
your instructions right!
5) I didn't do this, but I fixed it. A company that I used to work for ran
UCC-2 (DUO). At the time, they had 3350 DASD. After I left, they were moving
to 3380s. The person who took my place knew even less about DOS than I did.
He used HSM to archive a DOS "library" (equivalent to a maclib - I forget
the correct term just now) from a 3350 and restore it to a 3380.
Unfortunately, that level of library contained MBBCCHHR pointers to members
instead of TTRs. And to HSM, this looked like a BDAM file. The library was
totally hosed. Luckily, there was a tape backup, made with the proper DOS
utility, which I was able to use to recreate the library on the 3380. I
don't know what you can learn from this. Does anybody still use DUO or
whatever it may be called now?
John McKown
HealthAxis
All opinions are my own and are not the opinions of my employer.
Date: Fri, 5 Jan 2001 10:43:15 -0500
From: "Metz, Seymour"
Subject: Re: Learn from Other's Mistakes
"Opportunity"?
I recall installing the TSO Command Package and discovering that a whole
bunch of PTFs listed in the SUP keyword were actually prerequisites. No,
problem, I'll just restore the whole shebang from the backup. Imagine my
delight when I got a label check, and a scan revealed that the tape had
someone else's data set on it.
I naturally assumed that someone had overwritten my dump, until I looked at
the date. It seems that the tape library had two tapes with the same volume
serial number, and they couldn't find the one with my dump on it. I was not
a happy camper.
Why "opportunity"? Because this was one of the incidents that inspired me to
write (_ is Enya):
The boss said that he understood why I would want to wait
A week or two to run the PUT, but not six months or eight
He made me do a mass APPLY, but later said to me
The reason for the long delay I now begin to see!
Ma_ana, ma_ana, ma_ana is soon enough for me.
I once installed a FUNCTION, with grief it filled my cup
It had a lot of PTFs inside the keyword SUP
But oh my friends and ah my foes, guess what it did to me
When it turned out those PTFs should have been on the PRE!
Ma_ana, ma_ana, ma_ana is soon enough for me.
My system crashed this morning, and would not IPL
I called my friendly PSR and he told me "well
That mandatory PTF I told you to APPLY:
If you run JES2 or JES3, your system she will die!"
Ma_ana, ma_ana, ma_ana is soon enough for me.
The comments mentioned prereqs, I said "Why do I care?",
But it turned out that they forgot to put them on the VER.
Oh, this preventive service would be alright with me,
But I tried to do a LOCATE and it creamed my CVT.
Ma_ana, ma_ana, ma_ana is soon enough for me.
Those folks at Sterling Forrest I envy not one bit,
For every single PUT cycle they're certain to get hit;
There' a way they could help themselves and fill my heart with glee:
Take the damn JES2 change team and teach it SMP!
Ma_ana, ma_ana, ma_ana is soon enough for me.
Shmuel (Seymour J.) Metz
Date: Fri, 5 Jan 2001 11:47:00 -0500
From: Carl Sommer
Subject: Re: Learn from Other's Mistakes
(it appears that I'm a younger pup than most on this list)
My first job out of college was as an MVS systems programmer
for IBM, as part of team that supported 70+ systems across
7 sites. The opportunity for screw ups was very high, and I
certainly had my share. Some of these (or similar) have already
been mentioned.
My unique-most boo-boo would be creating a network storm
between a couple of JES2 systems, when I didn't get the CONNECT
stuff right. Seems I created a loop between two systems in
Florida, all of which thought the other had a connection to Texas.
But this thread brings back memories of many,
many, beeper calls...
+-------------------------------------------+
| Carl Sommer Carl.Sommer@netiq.com |
| NetIQ Corporation www.netiq.com |
| (919) 337-0251 Morrisville, NC |
+-------------------------------------------+
Date: Fri, 5 Jan 2001 11:48:21 -0500
From: Mike Hall
Subject: Re: Learn from Other's Mistakes
When I was a brand-new sysprog back in the mid-80's, I did VM & CICS
support for a small shop in Atlanta. We had a single 4341, and my boss & I
were the only 2 sysprogs. I was doing my first install of a new release of
VM, and had been testing it as a virtual machine under the primary VM. Now
I was ready to do a stand-alone IPL test. This shop was so small that our
system was shut down on the weekend, so one Saturday I went in to do my
test. I brought my then 3-year-old son along with me.
My test started out great - my VM install IPL'ed without a hitch, and I
stood at the master console in triumph, amid the hum of the disk drives,
savoring the moment of my first IPL of an OS I had installed myself.
Triumph abruptly turned to panic, however, as the hum of the machines
suddenly changed pitch. The drives were spinning down! I looked down at
the VM console for any error messages; nothing unusual there. I hit the
enter key - the console was frozen.
Then I looked around, and saw my son walking from around the end of the
CPU, pointing at it and looking very pleased with himself. The 4341 was a
long, low box with a big red paddle-style power switch at one end, in easy
reach of a small child, so naturally he had flipped it down, to the "off"
position. We had clear plastic covers on all the disk drive switches and
the IML/IPL buttons at the console, but that CPU power switch was just
hanging out there, naked and unprotected.
Now I really started sweating bullets, as I had recently read in the 4341
manual that once the switch was set to "off", an SE visit was required to
reset the machine before it could be powered back on. I was afraid this
would be a billable call, and I had been with this company for only a few
months. I figured I was knee-deep in it at this point.
I crossed my fingers and flipped the switch back to the "on" position. To
my relief, the system powered right back up. I IPL'ed production VM,
brought up VS1 & CICS, and tested the whole configuration. Everything
worked fine. I decided not to push my luck, and called it a day.
I kept the incident to myself, but a few days later, our system hung solid
at mid-day, as it sometimes did. I was in the machine room with my sysprog
boss working on the problem. He decided a power-on reset was called for,
but instead of pressing the IML button at the console as was normal
practice, he walked over to the CPU and flipped the power switch off and
back on. Seeing my opening, I innocently asked, "Doesn't doing that
require an SE to reset the CPU before you can power it back on?"
"Nah," he replied, "I had the SE disable that feature, so we wouldn't have
to call him in every time we needed to a power reset." Apparently this
standard practice in many 4341 shops.
A couple of months later, an operator managed to bump that switch
accidentally (or so he said) and powered off the CPU in the middle of our
big nightly batch run. A switch cover was installed on it soon afterwards.
Date: Fri, 5 Jan 2001 21:54:23 -0600
From: Lance Kopplin
Subject: Re: Learn from Other's Mistakes
Back in '75, running SVS, had some time before a vacation. Read the manual
_carefully_ and made a change, last thing before vacation. When I got back,
called the operator too ask how things were going. Got an earful of questions
about "had I done anything?" Undid what I did, system straightened out.
Eventually learned that the manual was wrong.
We had systems located in Omaha, Des Moines and Minneapolis. We were JES2 then
so the NJE connections were numbered. Omaha had several systems so the
operators were used to using the JES2 command to route output from N4 or N6 to
another environment that had a printer attached. One night, N4 got typoed to
N5. After a while, Des Moines noticed that their output was disappearing.
Analysis showed the problem and a simple command shut down the line. So they
got smug and took a break or whatever. While JES2 was noticing the path through
Minneapolis.
One of our System Programmers renumbered the JES2 RMT's. This was at a time in
history that we gave him the nickname - Moammar and the suicide JES parms.
RACF datasets filled at 8 o'clock on a Friday evening. We had adjourned to the
local bar/dive at logical five (it's five o'clock somewhere). The operators
knew where to call to find me and I wandered back. I noticed that they
seemed very intent on watching me at the keyboard. Then my mind focused well
enough to notice that I was typing a few letters, and then pausing. When I
paused my finger stayed on the last letter and the typamatic feature
stuckkkkkkkkkkkkkkkkkkkkkkkkkkk. Without thinking, back up and type a few
more. This was apparently much more entertaining that I normally am. We used
those RACF datasets for about six years.
Looking at a dump of a catalog abend on a Tuesday morning, the trace table
seemed to indicate a serialization problem. Looked at a UCB, no reserve count.
And the UCBTYP was unfamiliar. We had done a major DASD changeout over the
weekend, and changed to using HCD panels. We had always done gens by copying
the macros, and the share parameter didn't mean anything on the panel.
Operations took an afternoon for a team building meeting out in the country, so
we filled in on the consoles. Big storm came up, tornado warnings, sirens, the
whole nine yards. Decided to stick it out at the console. We got our chilled
water from a commercial source, they called from the basement to say that the
chilled water was warming up (oh great) and the department head walked in the
door with a flunky and said: "Oh, good. Kopplin is here". Didn't do a damn
thing, the tornado missed our building, the water cooled off after the tornado
went by, and nobody in operations knew there had been a tornado. Ya never know.
Back in the '70s, an IMS bug wiped out a VTOC. Operations manager had them
start the FDR restore, and after a few blinks on the 3420, had them cancel it
(that's enough to get the VTOC back). The techs were busy for a few weeks after
that, recovering data.
We had a 168 that overheated. The CE figured out there was sand in the chilled
water supply and backflushed the 168. Our Operations Manager had been a
Machinist Mate in the Navy and watched. Next time it started overheating, he
took a look, flipped some valves, and backflushed the 168. He was a _good_
Machinist Mate.
Ya know this could become the never ending story....
Lance
Date: Sun, 7 Jan 2001 14:03:42 +1100
From: Ken Brick
Organization: Brick Computer Services Pty. Ltd.
Subject: Re: Learn from Other's Mistakes
Mike Hall's mention of his son turning off a 4341 rung a bell. I used to
take my son aged about 3 into work on the weekend to give the wife some
relief. He found power buttons on the 3340 disk drives and also on a
printer I didn't know existed (the power off buttons not the units).
After many years as a DOS/VS sysprog I made the switch to MVS and one
day had an IPL scheduled for that evening. First IPL that I had to do.
JES2 failed with a JCL error. Apparently during the day my manager had
accidently deleted one of the PROCLIBS, realsed and restored it but
didn't catalog it. Caused a 20 mile trip to our other site to build a
SYSRES with a JES2 proc that worked. We changed the way we coded the
proc after that incident.
Another I remember because it involved an IBM mistake was we had a 4381
that was being upgraded to a dyadic. I went thru the stage 1 sysgen deck
and found a single parameter, ACR from memory, that I thought needed
changing and then a full sysgen. At upgrade planning review meeting IBM
told us no it didn't and whats more they confirmed it the next day. The
upgrade got done, the system IPLed and we only had one CP. Very fast
full sysgen had to be performed with red faced IBMers getting in the
way.
--
Ken Brick
Brick Computer Services Ltd.
kbrick@netspace.net.au
PH: 613 9817 5506
Mob. 0409009764
Date: Sun, 7 Jan 2001 19:21:24 +0000
From: Simon Pawson
Subject: Re: Learn from Other's Mistakes
List,
Apologies if this topic is dying off but I had to set up mail from home
as I felt that I couldn't post this from where I work now (which is were
I normally lurk on the list from). Please not that these events did NOT
happen where my current contract is (although at least one other person
on the list worked at the shop when these events happened.)
I confess to only one of these.
Problem (1) 'free Money'
Just before Xmas several years ago.....
Due to a minor bug a junior programmer was given the task of making a
minor change to the software that ran this Banks ATM network. This was
such a minor change to an urgent problem the change would go in via the
'patch' process rather than with a full system test. It got through unit
testing (designed by the programmer himself just to test his change) and
went live. Suddenly the branches reported that people were claiming not
to be getting the correct amount of money from the ATM.
The programmer had tidied up the code (as it comprised of very clumsy
coding of double negative IF statements) in another part of the same
module. What he had logically done is reverse the 10 and 5 note hoppers
in the machines across the entire network so that if you asked for 50
instead of getting 4x10 and 2x5 you got 4x5 and 2x10. Of course if you
asked for some amounts you got far more than you had asked for.
Interestingly some customers worked this out and went back many times.
Some lucky support people missed the pre Xmas office party writing a
quick and dirty program to reverse the process so customers ended up
being debited the correct amounts.
Moral (1)
Don't let the programmer test his own changes. There is no such thing as
a minor change on an important customer facing system. Don't fix things
that aren't broke. This individual also EPOed a centre once.
Problem (2). The library was there a minute ago.
Late one afternoon the shift ops phoned to say that IMS jobs that were
starting on the International Funds management IMS system were JCLing,
even though the IMS control region was up. It appeared that the IMS
reslib had disappeared.
What had happened is that (in those preGRS days) someone had run a job
that either deleted or renamed the library on the wrong system, without
a hold card. Sadly the rest of the suite that put in the new IMSGEN had
hold cards so were just sitting there ready to run. Luckily the system
was restored before the multi billion pounds of transactions missed
their deadlines.
Moral (2)
Double check your JCL. Again.
Problem (3) Disappearing ATMs
One morning the ops phoned to say that they had had reports from the
branches that their ATMs hadn't opened as normal. Instead of about 1700
cash dispensers there were only about 700 open. The branches with broken
ATMs couldn't do any work themselves. We then tried to back out the
change (which meant closing down the entire ATM and Branch network). The
backout failed as the backup was corrupt so we had to rebuild the entire
gen from scratch. It took hours.
What had happened is that the member of staff who had built this system
had decided without reference to anyone else to reorganise the source so
all the terminal definitions were now in alphabetical order. Sadly his
work wasn't very accurate and he removed lots of definitions that were
required. Interestingly enough he ignored error messages that came out
later in the process and continued without them. Sadly he had also been
responsible for the backup process and copied PDS files to the MSS as if
the 3330Vs were tapes. He then updated them as PDSes later on, thus
corrupting the backups.
Moral (3)
Get someone to check your work or even write something to do the
checking. Test the backups. If the system tells you you have made a
mistake you never know you may have...
This individual left and became the support manager for a software
house.
Problem (4) How to corrupt a database
Again, on the last working day before Xmas the shift ops reported lots
of error messages stating that the databases that held customer details
on the ATM network were possibly corrupt with lots of transactions
abending. There had been soft errors on a pack on this bank of drives
the previous day. In the 80s recovering this amount of data would mean
that the service would be out for the rest of the day. As you can
imagine there was a lot of panic. Strangely enough though retrying the
transactions seemed to work and the error messages moved across the
packs involved and jumped to other databases on the packs. Suddenly they
stopped with errors on fixed locations on the volumes.
It transpired that one of the storage team had run Inspects from another
CPU that had access to these volumes. Being keen I believe he had done
the whole string. As his team leader was in the pub he couldn't double
check that this was OK so he went ahead anyway. The ops had spotted job
running and cancelled it - leaving the binary test pattern rather than
the business data behind. The team leader was back from the pub very
quickly.
Moral (4).
Think before you cancel that run-away job. Interestingly enough the
team-leader got more trouble as he made a total disaster of recovering
the databases. Luckily he was in the pub with his senior manager.
If you are still reading by now I confess to number (2).
If you want more such as how dual frame 3990s (or maybe 3880s) do really
have a single point of failure and how an IBM engineer found it for us.
Or how a simple JCL error delayed one of the biggest IBM ESPs ever let
me know.
Simon
--
Simon Pawson
Date: Mon, 8 Jan 2001 08:43:58 +0100
From: Witold SCISLAK
Subject: Re: Learn from Other's Mistakes
Some years ago, at my previous job (a industrial enterprise,
1000-1500 mainframe terminals) I went down to HMC room to re-IPL
my test LPAR. Just after I confirmed the IPL process I realized
that I've just re-IPLed the p r o d u c t i o n LPAR. I was
staying without any movement, pale, covered with cold sweat,
waiting for furious telephones from the users but ... there was
a dead silence. What is going ? I looked at the clock. It was
15:01. In this enterprise, for both, workers and clerks, the
shift change time was 15:00. The first shift went home, the
second does not started their work yet (at mainframe terminals at
least.)
...............
Pozdrowienia/Regards, Witek.
- If you think education is expensive -- try ignorance -
OS/390 Software Support
Date: Mon, 8 Jan 2001 11:49:58 -0500
From: Dave Juraschek
Subject: Re: Learn from Other's Mistakes
(1)
Went to a shop once to upgrade a guest OS they ran under VM.
Before going there, insisted that they do two backups of the OS
being upgraded. The upgrade process required that you base
install the new OS, then apply local mods, files, applications,
accounts etc on top of the new OS code. Thus, the backup of
the old OS was critical. For safety sake, did a DDR backup of
everything upon arrival at the site.
Installed new OS and when I went to add their local stuff back
on top of the new OS, we discovered that both backups (as
well as all they had been doing for 6+ months) were trash. Seems
that there was an error message that nobody thought anything of.
And eventhough this OS had utility that would read the backup
and verify it's integrity, they had by policy opted to skip this
verification, trusting that the untested backup was o.k.
Good thing I had the DDR's. Had to restore their old system,
fix the backup process. Run it again (twice - I always backup
twice incase of I/O errors in the media). Run the verify job against
the backup. Re-install the OS and re-apply the local stuff on top.
Lost 18 hours for what should have been a 4 hour job. If it were
not for my DDR's, everything would have been toast. After that,
I've never compromised on the double backup discipline.
(2)
Same story as already told by others.
Made a minor change to SYSRES (JES, I think). System
wouldn't come up. Backup was bad - operators had re-used
a tape and forced it to be accepted - blowing away the backup.
Other backup had an I/O error. SOL and nauseated!
Luckily, the night before a full system dump had been done.
Had to hunt down it's output, figure out what tapes were used,
restored SYSRES. IPLed & re-did change (checking very carefully
this time - I knew what I had screwed up). Had system back up
just inside maintenance window. Thanked God for His grace.
-Dave
Date: Mon, 8 Jan 2001 12:46:18 -0600
From: Craig Otway
Subject: Re: Learn from Other's Mistakes
How about selecting a test lpar on the HMC to ipl and os/2 leaving the prod
icon selected too.:)
Date: Wed, 10 Jan 2001 18:10:46 -0600
From: John Ford
Subject: Re: Learn from Other's Mistakes
Thought you'd see the last of this thread?
I have three, from different shops I've worked at...
1) Male Power
Peter Duffy's Electrical Connectors problem reminded me of this one. We were
adding disk drives, and the electrician had run the power under the floor,
terminated with the connector specified. I was pulling floor tiles to make
sure all the cables were in the right place, and noticed the power
connector. Being attracted to shiny metal things, I picked it up, and
suddenly realized what I had in my hand was a MALE plug, with the prongs
carrying 200+ volts, just waiting for a place to go. Sure enough, when I
finally convinced the electrician to double-check it, they had ordered the
wrong part number. The female plug's part number was one-off the male.
Lesson: When an expert contradicts common sense, go with common sense.
2) Transfer to Oblivion
New release of banking application going in, during the nightly batch
window. Normal process is to run a job that deletes the backup load library,
copy current version from production to backup, delete production library
members, copy members from test library to production, and finally, delete
the members from test. To rehearse the process, we would comment out the
deletes, and point the "to" libraries to temporary datasets. For the actual
cut-over, we'd un-comment the deletes, and point the "to" libraries to the
real datasets. On this night, new "tech-support guy in charge of crap jobs"
did everything right except for pointing the "to" libraries back to the real
datasets -- he left them pointing to the temp datasets. Transfer job
completed normally, but all libraries were empty. Yup, we started submitting
compile jobs. Lesson: JCL and THC don't mix.
3) Unlabeled Tapes
Implementing an automated tape management system is a long, manual chore --
cataloging all datasets & volsers, changing JCL, etc., culminating in the
"leap of faith" step of removing the dataset labels from the tapes
themselves. Tech support assured the operations manager that they weren't
needed, and would just cause confusion if left on the tapes. Once we went
live, and proved the concept, the ops manager had the night shift remove the
labels from all the tapes in the racks. Next morning came the question, "How
do we know which tape is which?" Seems the night shift wasn't told to leave
the volser labels on the tapes. Fortunately, it was hundreds, not thousands,
of tapes in the racks. Lesson: Instructions should include what NOT to do.
Date: Thu, 11 Jan 2001 08:31:41 -0600
From: Cliff Hess
Subject: Re: Learn from Other's Mistakes
I've resisted this thread for over a week, partially because I thought a
similar story would be shared.
I got a call late one Friday afternoon that one of the RVAs in the shop was
getting moderate cache alerts which quickly turned into critical alerts.
About 45 seconds after getting the severe cache alert, the phone rang. When
an RVA hits 90% NCL, you get a call from IBM suggesting that you HTFQ (Hold
Those Foolish Queues). After cleaning up this mess, we got down to the
bottom of this issue.
A year earlier when this RVA arrived, it had enough back end storage to
support 128 addresses. Since the plan was to upgrade this box in a year,
256 addresses were genned to 'save time in the future'. That afternoon
another guy in the unit, who had only been there for 6 months, got a call
from the DB2. They wanted 16 volumes to 'try something' and not knowing the
configuration situation, obliged them and went off to a meeting without
telling anybody what he had done. I'd only been at this shop for a month so
I didn't have a clue either.
Of course, the genius behind all of this didn't remember this when he
'trained' the two of us. He suddenly remembered a meeting in another
building as soon as the trouble started. It made for quite an interesting
Friday afternoon that lasted until about 10:30 PM.
Recounting this story has reminded me of a question that I've been meaning
to ask. Does anybody know for sure what happens if an RVA hits 100% NCL?
I've heard that the micro code gets confused and all data on the box is
lost. But then again I also heard that:
Jerry Mathers (the Beaver) was killed in Viet Nam in 1970.
Mel Gibson had to have reconstructive plastic surgery on his face after a
bar room fight when he was 18.
If you leave a cat alone in a room with a baby, it will be attracted to
smell of milk on it's breath and suck the life out of it.
Date: Wed, 17 Jan 2001 02:31:24 +0800
From: Ron & Jenny Hawkins
Subject: Re: Learn from Other's Mistakes
I tested this years ago on an original 9200 (like, it had HP drives in
it). It just stops accepting writes, like when cache is full. In the
shop where I set this up I allocated 2 volumes with 16MB of hard to
compress data. If the box showed signs of going non corpus mentis we
just deleted the volumes so it had some more room to write.
Back to home page