1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
|
System crash dump analysis report
=================================
Symptom
-------
System behaved as hang, TOC dump be taken for root cause finding.
Root Cause
----------
A contention between beta semaphore shm_lock and vmtotal_lock.
This lead to hang situation.
Action Plan
-----------
Cumulative VM patch PHKL_36133 installation is recommended.
PHKL_36133 (2) [S] Cumulative VM, Psets, Preemption, PRM, MRG
(JAGae20932)
Detail Analysis
---------------
p4> SystemInfo
System Name : HP-UX
Node Name : localhost
Model : 9000/800/SD64000
HP-UX version : B.11.11 (64-bit Kernel)
Number of CPUs : 8
Disabled CPUs : 2
CPU type : PCX-W+ (875 Mhz)
CPU Architecture : PA-RISC 2.0
Physical Memory : 2093056 pages ( 7.98 Gb )
Panic string : , isr.ior = 0'0.0'0
Boot string : disk(12/0/8/0/0.0.0.0.0.0.0;0)/stand/vmunix
Kernel KRS file : /stand/krs/_stand_vmunix.krs
System KRS file : /stand/krs/system.krs
System boot time : Tue Apr 8 17:23:31 2008 (TZ=UTC-8)
System down time : Tue Nov 11 04:50:36 2008 (TZ=UTC-8)
System was up for 1870431197 ticks ( 216 days 11 hours 38 mins 31 secs )
Load averages:
avenrun : 2.000 2.000 2.000
real_run : 2.000 2.000 2.000
disk wait jobs : 22 [hint of 'pwrun' from vmtotal() - 1653648 ticks ago]
mp_avenrun[ 0] : 0.000 0.000 0.000
mp_avenrun[ 1] : 0.000 0.000 0.000
mp_avenrun[ 2] : 0.000 0.000 0.000
mp_avenrun[ 3]* : 0.000 0.000 0.000
mp_avenrun[ 4] : 0.000 0.000 0.000
mp_avenrun[ 5] : 2.000 2.000 2.000
mp_avenrun[ 6]* : 0.048 0.015 0.005
mp_avenrun[ 7] : 0.000 0.000 0.000
( * => disabled cpu )
deactivated cnt : 20
Kthread states:
Number of threads in TSRUN state = 23
Number of threads in TSSLEEP state = 431
Number of threads in TSSTOP state = 10
Number of threads in TSZOMB state = 5
Number of threads in TSIDL state = 6
Number of threads with TSDEACT flag = 21
Number of threads with TSDEACTSELF flag = 1
Processor states:
Processor : Cpu_State Operational_State Config_State
cpu[ 0] : SPUSTATE_IDLE SPU_ENABLED PROC_CONFIGED
cpu[ 1] : SPUSTATE_IDLE SPU_ENABLED PROC_CONFIGED
cpu[ 2] : SPUSTATE_IDLE SPU_ENABLED PROC_CONFIGED
cpu[ 3] : SPUSTATE_IDLE SPU_DISABLED PROC_CONFIGED
cpu[ 4] : SPUSTATE_IDLE SPU_ENABLED PROC_CONFIGED
cpu[ 5] : SPUSTATE_SYSTEM SPU_ENABLED PROC_CONFIGED
cpu[ 6] : SPUSTATE_IDLE SPU_DISABLED PROC_CONFIGED
cpu[ 7] : SPUSTATE_IDLE SPU_ENABLED PROC_CONFIGED
The load averages is very low and most of the processors (except cpu 5)
are idle. The processors are healthy and they are not in a blocked state.
Crashinfo output clearly shows that there is a contention for the beta
semaphore shm_lock and vmtotal_lock. This is not good since this will lead
to hang situation.
Most Common Wait Channels
=========================
ticks since run:
Wait Channel count longest shortest
------------ ----- ---------- ----------
shm_lock 69 1652651 1473696
vx_worklist_thread_sv 25 128 28
async_bufhead 20 1870420709 1870420709
lock_read(0x56c14ed8) 18 1726014 1602477
voliod_sync 10 1870425045 1870425042
per_processor_selects 8 1870419272 4
streams_mp_sync 8 9620508 9620482
memory_sleepers 8 1473679 52665
per_processor_selects+0x80 7 1870415140 1180
vmtotal_lock 7 1629625 1479673
...
And their waiters are ...
Threads waiting on Beta Semaphores
==================================
TICKS
WAITING OWNER SINCE
TID TID RUN CPU PRI SEMA ADDRESS CALLER
------- ------- ---------- --- --- -------------- ----------
133139 312772 1629625 0 646 0xef3148 vmtotal+0x38
3930 312772 1623646 0 571 0xef3148 vmtotal+0x38
133203 312772 1599623 0 646 0xef3148 vmtotal+0x38
133268 312772 1569617 0 646 0xef3148 vmtotal+0x38
133338 312772 1539596 0 646 0xef3148 vmtotal+0x38
133403 312772 1509605 0 646 0xef3148 vmtotal+0x38
133486 312772 1479673 0 646 0xef3148 vmtotal+0x38
133080 133079 1652651 0 646 0xeba700 shmget+0x44
312772 133079 1649495 0 646 0xeba700 shm_unattached_vmtotal+0x24
5367 133079 1644634 0 646 0xeba700 shm_fork+0x20
133087 133079 1641657 0 646 0xeba700 shmget+0x44
133145 133079 1623653 0 646 0xeba700 shmget+0x44
133151 133079 1611649 0 646 0xeba700 shmget+0x44
...
Let's see what these two owner threads (tid 312772 and 133079) are doing;
p4> trace -U tid 312772
proc_list{131} pid=24466 tid=312772 cmd="ora_pmon_ARDR"
p_stat="SINUSE" : kt_stat="TSSLEEP"
Process : p proc_t 0x569cd040
pid=24466 oracle
Kthread : p kthread_t 0x5c08d040
Using PCB: p user_t 0x8619c00.0x400003ffffff0000
SR5=0x8619c00
LVL FUNC ( ARG0, ARG1, ARG2, ARG3, ARG4 )
0) _swtch+0xc4 ( 0x1ec82e, n/a, n/a, n/a, n/a )
1) _mp_b_sema_sleep+0x108 ( 0xeba700, n/a, n/a, n/a, n/a )
2) shm_unattached_vmtotal+0x24 ( 0x400003ffffff0f94, 0x400003ffffff0f98, n/a, n/a, n/a )
3) vmtotal+0x110 ( n/a, n/a, n/a, n/a, 0xffffffffffffffff )
4) pstat_dynamic+0xa4 ( n/a, n/a, n/a, n/a, n/a )
5) pstat+0xf4 ( 0x400003ffffff03a0, 0x8d5aec, n/a, 0xb25510, n/a )
6) syscall+0xaec ( n/a, n/a, n/a, n/a, n/a )
7) syscallinit+0x554 ( n/a, n/a, n/a, n/a, n/a )
p4> trace -U tid 133079
proc_list{202} pid=28294 tid=133079 cmd="oracleARDR
(DESCRIPTION=(LOCAL=YES)(ADDR" p_stat="SINUSE" : kt_stat="TSSLEEP"
Process : p proc_t 0x782a0040
pid=28294 oracle
Kthread : p kthread_t 0x67d53040
Using PCB: p user_t 0x1d3d000.0x400003ffffff0000
SR5=0x1d3d000
LVL FUNC ( ARG0, ARG1, ARG2, ARG3 )
0) _swtch+0xc4 ( 0x1634c4, n/a, n/a, n/a )
1) _sleep_one+0x1a8 ( 0x56c14ede, 0x285, n/a, n/a )
2) lock_write+0x33c ( 0x56c14ec8, n/a, n/a, n/a )
3) shmat+0x184 ( 0x400003ffffff03a0, 0x8d55f8, n/a, 0xb25510 )
4) syscall+0xaec ( n/a, n/a, n/a, n/a )
5) syscallinit+0x554 ( n/a, n/a, n/a, n/a )
Tid 312772 got the vmtotal_lock bsema and is now asleep waiting for
shm_lock. So, all the beta semaphores waiters are due to tid 133079 which
is currently holding shm_lock. This thread has gone to sleep waiting for a
rw_lock.
p4> Kthread -t 312772
Loaded 1 kthread_t entries in 'DefaultView'
p4> pview -V
Kthread tid pri spu grp bind kt_stat lastrun pid comm syscall kt_wchan
0x5c08d040 312772 134 1 -1 no TSSLEEP 1649495 24466 oracle pstat 0
p4> pview -V
Kthread tid pri spu grp bind kt_stat lastrun pid comm syscall kt_wchan
0x67d53040 133079 133 2 -1 no TSSLEEP 1653533 28294 oracle shmat 0x56c14ede
p4> Kthread -t 2
Loaded 1 kthread_t entries in 'DefaultView'
p4> pview -V
Kthread tid pri spu grp bind kt_stat lastrun pid comm syscall kt_wchan
0x515a8040 2 128 5 -1 no TSRUN 1726013 2 vhand KI_vHAND 0
The owner of this rw_lock is responsible for tid 133079 being asleep for so
long is. The rw_lock structure shows who its owner is ie,
p4> p struct rw_lock 0x56c14ec8
0x56c14ec8
0x56c14ec8 :
0x56c14ec8 struct rw_lock {
0x56c14ec8 lock_t *interlock; 0x00000000_1a98f7c0
0x56c14ed0 unsigned int delay; 0x00000000
0x56c14ed4 unsigned int write_waiters; 0x00000002
0x56c14ed8 int read_count; 0x00000000
0x56c14edc char want_write; 0x01
0x56c14edd char want_upgrade; 0x00
0x56c14ede char waiting; 0x09
0x56c14edf char rwl_flags; 0x05
0x56c14ee0 struct kthread *l_kthread; 0x00000000_515a8040
0x56c14ee8 };
p4> trace -U -T 0x00000000_515a8040
proc_list{5} pid=2 tid=2 cmd="vhand"
p_stat="SINUSE" : kt_stat="TSRUN"
Process : p proc_t 0x515a7040
pid=2 vhand
Kthread : p kthread_t 0x515a8040
Running on SPU #5
Processor #5
============== EVENT ============================
= Event #7 is TOC on CPU #5
= p crash_event_t 0xa79150
= p rpb_t 0x37b8ef0
= Using pc from pim.wide.rp_pcoq_head_hi = 0x176b34
============== EVENT ============================
SR5=0x1f04800
LVL FUNC ( ARG0, ARG1, ARG2, ARG3, ARG4, ARG5, ARG6 )
0) superpage_unlock+0x84 ( 0x5ab67, n/a, n/a, n/a, n/a, n/a, n/a )
1) vhand_vfdcheck+0x110 ( 0x5ab67, 0x453b9938, 0x453b993c, 0x400003ffffff0c70, n/a, n/a, n/a )
2) for_val3+0x5c ( 0x56c14e40, n/a, n/a, n/a, 0x400003ffffff0d10, n/a, n/a )
3) for_val2+0x29c ( 0x575c21c0, n/a, 0x40000, n/a, 0x471f30, 0x400003ffffff0d10, 0x4 )
4) for_val2+0x13c ( 0x575c21c0, n/a, 0x40000, n/a, 0x471f30, 0x400003ffffff0d10, 0x3 )
5) for_val2+0x13c ( 0x575c21c0, n/a, 0x40000, n/a, 0x471f30, 0x400003ffffff0d10, 0x2 )
6) for_val2+0x13c ( n/a, n/a, n/a, n/a, 0x471f30, 0x400003ffffff0d10, 0x1 )
7) foreach_valid+0x54 ( 0x56c14e40, 0x40000, 0x40000, 0x18f640, 0x400003ffffff0c70, n/a, n/a )
8) agepages+0x11c ( 0x56649440, 0xef3, n/a, n/a, 0xbd79e0, n/a, n/a )
9) vhand_core+0x294 ( 0x400003ffffff0a50, 0x0, 0x0, n/a, n/a, n/a, n/a )
10) vhand+0x2fc ( n/a, n/a, n/a, n/a, n/a, n/a, n/a )
11) im_vhand+0xd4 ( n/a, n/a, n/a, n/a, n/a, n/a, n/a )
12) DoCalllist+0x3c ( n/a, 0x1, n/a, n/a, n/a, n/a, n/a )
13) main+0x28 ( n/a, n/a, n/a, n/a, n/a, n/a, n/a )
14) $vstart+0x48 ( n/a, n/a, n/a, n/a, n/a, n/a, n/a )
15) $locore+0x94 ( n/a, n/a, n/a, n/a, n/a, n/a, n/a )
Vhand owns the rw_lock and is currently running on cpu 5. It has been
hogging cpu 5 for a long time ie,
p4> ProcInfo 5
Cpu state and thread information :
----------------------------------
cpu hpa uareasid threadp prevthreadp curstate
5 0xfffffffffcc7a000 0x1f04800 0x515a8040 0x515a8040 SPUSTATE_SYSTEM
ICS, idle and kernel stack information :
----------------------------------------
cpu ics ibase topics idlestack ksp
5 0x3806000 0x3806000 0x3811000 0x1a9af000 0
Last idle, last ran a timeshare thread, last entry into kernel :
-----------------=----------------------------------------------
ticks_since_boot = 0x6f7c83dd
cpu last_idletime last_tsharetime preempt_start_time
5 0x6f622da0 (1726013) 0x6f622da0 (1726013) 0x6f622e0b (1725906)
This problem has been resolved by patch PHKL_27825. Its latest replacement
patch is PHKL_36133. I would suggest that the customer install PHKL_36133
to address this problem.
|