EXPDP Streams AQ: enqueue blocked on low memory

有一台11.2.0.4的库最近expdp导出非常慢,总是停留在99%以后还要持续2个多小时,还有另外一台同版本的数据库也碰到了类似的情况。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Job: SYS_EXPORT_SCHEMA_01
Owner: XXX
Operation: EXPORT
Creator Privs: TRUE
GUID: 8A27A5FA09F142F9E055000000000001
Start Time:
Mode: SCHEMA
Instance:
Max Parallelism: 4
EXPORT Job Parameters:
Parameter Name Parameter Value:
CLIENT_COMMAND backup_user/******** dumpfile=
COMPRESSION ALL
State: EXECUTING
Bytes Processed: 45,234,549,312
Percent Done: 99
Current Parallelism: 4
Job Error Count: 0

当前导出的对象都是一些空的分区,查询导出任务是否在等待

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
SQL> select dw.waiting_session,
dw.holding_session,
b.serial#,
w.event,
w.program wprogram,
b.program bprogram,
w.module wmod,
b.module bmod,
LOCK_ID1
from sys.dba_waiters dw, v$session w, v$session b
where dw.waiting_session = w.sid
and dw.holding_session = b.sid
and (w.module like 'Data Pump%' or w.program like '%EXPDP%' or
w.program like '%IMPDP%')
order by dw.holding_session;

no rows selected


SQL> select sw.SID,
sw.SEQ#,
sw.EVENT,
sw.WAIT_TIME,
sw.SECONDS_IN_WAIT,
sw.STATE,
sw.P1TEXT,
sw.P1,
sw.P2TEXT,
sw.P2,
sw.P3TEXT,
sw.P3
from V$SESSION_WAIT sw, v$session s
where sw.wait_class <> 'Idle'
and sw.sid = s.sid
and (s.module like 'Data Pump%' or s.program like '%EXPDP%' or
s.program like '%IMPDP%');

SID SEQ# EVENT WAIT_TIME SECONDS_IN_WAIT STATE P1TEXT P1 P2TEXT P2 P3TEXT P3
---------- ---------- ---------------------------------------------------------------- ---------- --------------- ------------------- ---------- ---------- ---------- ---------- ---------- ----------
1801 31734 Streams AQ: enqueue blocked on low memory 0 1 WAITING 0 0 0
2540 3931 Streams AQ: enqueue blocked on low memory 0 1 WAITING 0 0 0

等待显示导出队列在等待与streams相关的内存,在sga中就是streams pool,查看当前streams pool的使用情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
select component
,current_size/1024/1024 "Current MB"
,min_size/1024/1024 "Min MB"
from v$sga_dynamic_components;

COMPONENT Current MB Min MB
---------------------------------------------------------------- ---------- ----------
shared pool 3232 1472
large pool 64 64
java pool 32 32
streams pool 0 0
DEFAULT buffer cache 6624 6432
KEEP buffer cache 0 0
RECYCLE buffer cache 0 0
DEFAULT 2K buffer cache 0 0
DEFAULT 4K buffer cache 0 0
DEFAULT 8K buffer cache 0 0
DEFAULT 16K buffer cache 0 0
DEFAULT 32K buffer cache 0 0
Shared IO Pool 0 0
ASM Buffer Cache 0 0

因为这个库设置了sga_target,相关组件的内存大小分配是由oracle自动来控制的,所以可能碰上了bug。

Workaround:

  1. 重启数据库,通常能解决这个问题
  2. 修改参数,设定streams_pool_size的最小值
1
2
CONNECT / as sysdba
ALTER SYSTEM SET streams_pool_size=150m SCOPE=both;
  1. 如果无法动态调整,则需要重启
1
2
3
4
CONNECT / as sysdba
ALTER SYSTEM SET streams_pool_size=150m SCOPE=spfile;
SHUTDOWN IMMEDIATE
STARTUP
  1. 打补丁Patch 17365043
    或者修改隐含参数
1
alter system set "_disable_streams_pool_auto_tuning"=TRUE;

做完以后重新执行datapump任务。

ORA-15001: diskgroup "FRA" does not exist or is not mounted

一台数据库在启动时候报错,提示一个asm group不存在

1
2
3
4
5
6
7
8
9
10
11
12
ORA-00210: cannot open the specified control file
ORA-00202: control file: '+FRA/STBRAC/CONTROLFILE/current.294.1005908473'
ORA-17503: ksfdopn:2 Failed to open file +FRA/STBRAC/CONTROLFILE/current.294.1005908473
ORA-15001: diskgroup "FRA" does not exist or is not mounted
ORA-15040: diskgroup is incomplete
ORA-205 signalled during: alter database mount...


WARNING: failed to open a disk[/dev/asmdiskd]
ORA-15025: could not open disk "/dev/asmdiskd"
ORA-27041: unable to open file
Linux-x86_64 Error: 13: Permission denied

检查asm的磁盘等信息均正常

1
2
3
4
5
6
7
8
9
10
11
[grid@stbracnode1 ~]$ asmcmd
ASMCMD> lsdg
State Type Rebal Sector Logical_Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name
MOUNTED EXTERN N 512 512 4096 4194304 51200 16784 0 16784 0 Y CRS/
MOUNTED EXTERN N 512 512 4096 4194304 61440 55336 0 55336 0 N DATA/
MOUNTED EXTERN N 512 512 4096 4194304 20480 17404 0 17404 0 N FRA/

[grid@stbracnode1 ~]$ ll /dev/asm*
brw-rw---- 1 grid asmadmin 8, 16 Apr 18 13:22 /dev/asmdiskb
brw-rw---- 1 grid asmadmin 8, 32 Apr 18 13:22 /dev/asmdiskc
brw-rw---- 1 grid asmadmin 8, 48 Apr 18 13:22 /dev/asmdiskd

查询MOS说是权限出现了问题,ORACLE_HOME目录下的oracle二进制文件的属主应该是asmadmin,而不是oinstall。

1
2
[oracle@ ~]$ ll /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle
-rwsr-s--x 1 oracle oinstall 407944928 Apr 16 15:53 /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle

查看其它正常的asm节点

1
2
[oracle@ dbs]$ ll /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle
-rwsr-s--x 1 oracle asmadmin 407944928 Apr 16 15:48 /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle

这台集群刚刚进行了grid和database的补丁升级,应该是由于这个原因导致属主进行了变更

知道问题原因以后,解决办法就简单了,切换到grid用户下

1
2
3
4
setasmgidwrap o=/u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle

[grid@ ~]$ ll /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle
-rwsr-s--x 1 oracle asmadmin 407944928 Apr 16 15:56 /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle

重新启动数据库,问题解决

1
2
3
4
5
6
7
8
9
SQL> startup mount
ORACLE instance started.

Total System Global Area 2516582400 bytes
Fixed Size 8795904 bytes
Variable Size 754977024 bytes
Database Buffers 1744830464 bytes
Redo Buffers 7979008 bytes
Database mounted.

ORA-20200: Database/Instance does not exist in AWR_PDB_DATABASE_INSTANCE

当我在12.2中通过pdb登录生成awr时报错

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[oracle@xb admin]$ sqlplus sys/oracle@pdb12c as sysdba

sys@PDB12C> @?/rdbms/admin/awrsqrpt.sql

Specify the Report Type
~~~~~~~~~~~~~~~~~~~~~~~
Would you like an HTML report, or a plain text report?
Enter &#039;html&#039; for an HTML report, or &#039;text&#039; for plain text
Defaults to &#039;html&#039;
Enter value for report_type:

Type Specified: html

Specify the location of AWR Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
AWR_ROOT - Use AWR data from root (default)
AWR_PDB - Use AWR data from PDB
Enter value for awr_location: AWR_PDB

Location of AWR Data Specified: AWR_PDB >>>>====这里选择的AWR_PDB

declare
*
ERROR at line 1:
ORA-20200: Database/Instance 3393322654/1 does not exist in
AWR_PDB_DATABASE_INSTANCE
ORA-06512: at line 27

但是如果选择AWR_ROOT时则正常

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Location of AWR Data Specified: AWR_ROOT        >>>>====

...

Current Instance
~~~~~~~~~~~~~~~~
...

DB Id DB Name Inst Num Instance
----------- ------------ -------- ------------
380590835 ORA12C 1 ora12c

...

Instances in this Workload Repository schema
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
DB Id Inst Num DB Name Instance Host
------------ ---------- --------- ---------- ------
* 380590835 1 ORA12C ora12c xb.oracle.co

Using 380590835 for database Id
Using 1 for instance number

Enter value for num_days:

换成CDB登录一样正常,因为AWR Data默认就是CDB。

这是由于参数Awr_pdb_autoflush_enabled默认为FALSE,阻止了AWR相关视图的数据写入,所以接下来就说一下如何开启PDB级别的awr自动快照。

Oracle 12.2开始允许在CDB或PDB级别的AWR快照,CDB级别的快照包括CDB和PDB的统计信息。

AWR报告会列出所有可用容器库的相关信息,下图为CBD级别awr报告中展示的部分sql统计信息,包含PDB和根CDB

自动快照默认是在CDB级别开启,在PDB级别关闭,oracle建议是在PDB级别进行手工收集,或者只对单独有性能问题的PDB开启自动快照。

可以通过将参数Awr_pdb_autoflush_enabled设为TRUE,如果在CDB级别设置这个参数,则会对所有PDB都进行开启自动快照,如果只在PDB级别设置,则只会影响单个PDB。

当设置完参数以后,登录到pdb里面,通过DBMS_WORKLOAD_REPOSITORY.MODIFY_SNAPSHOT_SETTINGS过程将interval设为大于0的值。

1
2
3
4
5
6
7
8
9
10
sys@ORA12C> show pdbs

CON_ID CON_NAME OPEN MODE RESTRICTED
---------- ------------------------------ ---------- ----------
2 PDB$SEED READ ONLY NO
3 PDB12C READ WRITE NO

--cdb级别开启
ALTER SYSTEM SET awr_pdb_autoflush_enabled= TRUE SID='*' SCOPE=BOTH;
ALTER SYSTEM SET awr_snapshot_time_offset=1000000 SID='*' SCOPE=BOTH;

参数 awr_snapshot_time_offset 默认为0. 这里设成1000000主要是为了对不同pdb在创建awr时能有不同的时间偏移量,这样就能避免cpu瞬间增高

将默认值保持为零可能会在资源较低的系统中导致CPU峰值

登录pdb,执行MODIFY_SNAPSHOT_SETTINGS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
sys@ORA12C> alter session set container=pdb12c;

Session altered.

sys@ORA12C> select * from awr_pdb_wr_control;

DBID SNAP_INTERVAL RETENTION TOPNSQL CON_ID
---------- --------------------------------------------------------------------------- ------------------------------------------------ ---------- ----------
3393322654 +40150 00:01:00.0 +00008 00:00:00.0 DEFAULT 3


BEGIN
DBMS_WORKLOAD_REPOSITORY.MODIFY_SNAPSHOT_SETTINGS( retention => 11520,
interval => 60,
topnsql => 34,
dbid => 3393322654);
END;
/


sys@ORA12C> select * from awr_pdb_wr_control;

DBID SNAP_INTERVAL RETENTION TOPNSQL CON_ID
---------- --------------------------------------------------------------------------- --------------------------------------------------------------------------- ---------- ----------
3393322654 +00000 01:00:00.0 +00008 00:00:00.0 34 3

当第一个snapshot生成以后,AWR_PDB_DATABASE_INSTANCE则会包含PDB的数据

1
2
3
4
5
6
7
8
9
10
11
sys@ORA12C> show con_name

CON_NAME
------------------------------
PDB12C

sys@ORA12C> select con_id,dbid,INSTANCE_NUMBER,INSTANCE_NAME,CDB,LAST_ASH_SAMPLE_ID from AWR_PDB_DATABASE_INSTANCE;

CON_ID DBID INSTANCE_NUMBER INSTANCE_NAME CDB LAST_ASH_SAMPLE_ID
---------- ---------- --------------- ---------------- --- ------------------
3 3393322654 1 ora12c YES 115051

那重新来执行之前报错的awr脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
sys@ORA12C> @?/rdbms/admin/awrrpt.sql

Specify the Report Type
~~~~~~~~~~~~~~~~~~~~~~~
AWR reports can be generated in the following formats. Please enter the
name of the format at the prompt. Default value is &#039;html&#039;.

&#039;html&#039; HTML format (default)
&#039;text&#039; Text format
&#039;active-html&#039; Includes Performance Hub active report

Enter value for report_type:

Type Specified: html

Specify the location of AWR Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
AWR_ROOT - Use AWR data from root (default)
AWR_PDB - Use AWR data from PDB
Enter value for awr_location: AWR_PDB

Location of AWR Data Specified: AWR_PDB >>>>====选择AWR_PDB并未报错

...

Specify the number of days of snapshots to choose from
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Entering the number of days (n) will result in the most recent
(n) days of snapshots being listed. Pressing >return> without
specifying a number lists all completed snapshots.


Enter value for num_days: 1

oracle无法启动ORA-27154,ORA-27300,ORA-27301,ORA-27302

发现一台实例突然无法启动,版本11.2.0.4

1
2
3
4
5
SQL> startup nomount
ORA-27154: post/wait create failed
ORA-27300: OS system dependent operation:semget failed with status: 28
ORA-27301: OS failure message: No space left on device
ORA-27302: failure occurred at: sskgpcreates

看报错信息大概是sskgpcreates创建失败,资源不足。这个主要是在数据库启动时根据process参数设置的值去分配信号资源,如果内核参数设置的不合理,则会导致报错。

在Linux里,信号量主要是用来控制特定进程的利用率。它是一个可共享的资源,分别由P (wait)和V (signal)函数操作,这两个函数分别递减和递增信号量。当进程需要资源时,发出”wait”,信号量递减。当信号量的值为0时,资源不可用,调用进程会spin或阻塞(视情况而定),直到资源可用为止。当进程释放由信号量控制的资源时,它会增加信号量并通知等待的进程。由下面几个参数控制

信号量 描述 最小值
SEMMSL 每个组的最大信号量 128
SEMMNS 系统级别最大信号数
SEMOPM 每个semop调用的最大操作
SEMMNI 最大组数

计算当前最少需要的信号量总数

1
sum(本机所有实例的process参数值)+本机系统需求+本机其他程序需求
  • 设置semmns大于这个sum值
  • 设置semmsl为256
  • 设置semmni为semmns/semmsl

查看当前系统参数

1
2
3
4
5
6
7
8
$ ipcs -ls

------ Semaphore Limits --------
max number of arrays = 128 // SEMMNI
max semaphores per array = 250 // SEMMSL
max semaphores system wide = 32000 // SEMMNS
max ops per semop call = 100 // SEMOP
semaphore max value = 32767

从这个配置来看理论上系统级别的最大信号数为32000,最大组数为128,每个组最大信号量250。

但是ipcs命令显示每个信号量标识符最多可以容纳156个Oracle信号量

1
2
3
4
5
6
7
8
9
10
11
$ ipcs
...
------ Semaphore Arrays --------
key semid owner perms nsems
0x2ff4a110 5701651 oracle 640 156
0x2ff4a111 5734420 oracle 640 156
0x2ff4a112 5767189 oracle 640 156
0x2ff4a113 5799958 oracle 640 156
0x2ff4a114 5832727 oracle 640 156
0x2ff4a115 5865496 oracle 640 156
...

那么实际上最大可用的信号总量为156 x 128=19968,所以要增加最大可用组数。

1
2
3
4
5
[root@whzdb1 ~]# /sbin/sysctl -a | grep sem
kernel.sem = 250 32000 100 128

-- change to:
kernel.sem = 250 32000 100 200

分析一则enq: HW – contention问题

同事在做压力测试的时候,说数据库很慢,于是登上数据库看了下,发现大量的enq: TX - row lock contentionenq: HW - contention事件。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
%This     SECONDS        AAS WAIT_CHAIN
------ ---------- ---------- ------------------------------------------------------------------------------------------------------------------------------------------------
18% 960 1.6 -> ON CPU
9% 470 .8 -> enq: TX - row lock contention -> ON CPU
8% 460 .8 -> enq: HW - contention -> ON CPU
8% 410 .7 -> enq: TX - row lock contention -> enq: HW - contention -> ON CPU
5% 260 .4 -> enq: TX - row lock contention -> enq: HW - contention -> enq: HW - contention
4% 220 .4 -> enq: HW - contention -> enq: HW - contention
4% 210 .4 -> enq: HW - contention -> enq: HW - contention -> ON CPU
3% 190 .3 -> enq: TX - row lock contention -> enq: HW - contention -> enq: HW - contention -> ON CPU
3% 180 .3 -> enq: TX - row lock contention -> enq: TX - row lock contention -> enq: HW - contention -> ON CPU
3% 160 .3 -> enq: TX - row lock contention -> enq: TX - row lock contention -> ON CPU
2% 120 .2 -> enq: HW - contention -> enq: HW - contention -> enq: HW - contention
2% 110 .2 -> enq: TX - row lock contention -> enq: HW - contention -> enq: HW - contention -> enq: HW - contention -> ON CPU
2% 100 .2 -> log file parallel write
2% 100 .2 -> enq: TX - row lock contention -> enq: HW - contention -> enq: HW - contention -> enq: HW - contention -> enq: HW - contention
2% 90 .2 -> enq: TX - row lock contention -> log file sync -> log file parallel write
2% 90 .2 -> enq: HW - contention -> enq: HW - contention -> enq: HW - contention -> enq: HW - contention -> enq: HW - contention -> enq: HW - contention
1% 80 .1 -> enq: TX - row lock contention -> enq: TX - row lock contention -> enq: HW - contention -> enq: HW - contention
1% 70 .1 -> enq: HW - contention -> enq: HW - contention -> enq: HW - contention -> ON CPU

前者一般情况代表应用程序可能存在不合理的情况,导致大量的行锁争用,而后者表示高水位的移动争用,HW锁通常是用于管理超出高水位的段空间分配。段的高水位线是段中已使用空间和未使用空间的边界,如果出现了这个等待事件也就表示使用的数据达到了高水位线,需要将高水位线提高来满足额外的数据插入,通常是频繁的数据扩展造成的。

通过v$session_wait查出P3值等

1
2
3
4
5
6
select P2TEXT,p2, p3 from v$session_wait where event = 'enq: HW - contention';

P2TEXT P2 P3
-------------------- ---------- ----------
table space 6 21051234

可以看出是表空间的自动扩展导致,然后用于分析是什么对象

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
select DBMS_UTILITY.DATA_BLOCK_ADDRESS_FILE(21051234) FILE#,
DBMS_UTILITY.DATA_BLOCK_ADDRESS_BLOCK(21051234) BLOCK#
from dual;

FILE# BLOCK#
---------- ----------
5 79714


select owner, segment_type, segment_name
from dba_extents
where file_id = 5
and 79714 between block_id and block_id + blocks - 1
and tablespace_name =
(select name from ts$ where ts# = 6);

OWNER SEGMENT_TYPE SEGMENT_NAME
------------------------------ ------------------ ---------------------------------------------------------------------------------
XXX TABLE XXX_T

此外,如果这个等待事件是正在发生的,可以直接查看到使用的段

1
2
3
4
select DBMS_UTILITY.DATA_BLOCK_ADDRESS_FILE(ID2) FILE#,
DBMS_UTILITY.DATA_BLOCK_ADDRESS_BLOCK(ID2) BLOCK#
from v$lock
where type = 'HW';

所以为了解决这个争用,直接调大increment_by的值,或者直接调成非自动扩展模式,即可大幅缓解这个问题。

除了今天碰到的这种情况外,引起enq HW - contention事件还有可能是其他的原因

  • IO性能问题
  • 一个包含繁忙lob段的表需要满足平均分配DML操作到各个分区的分区方式
  • lob段频繁的收缩扩展

针对最后一种情况也有一些常规的解决办法:

  • 手动给lob段增加一些额外的空间,这样lob段就不需要频繁的回收块
1
ALTER TABLE <lob_table> MODIFY LOB (<column_name>) (allocate extent (size <extent size>));
  • 通过shrink space命令和dbms_redefinition来对空间进行回收
1
ALTER TABLE test_lob MODIFY LOB (image) (SHRINK SPACE);

ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below

在做一个rman异机恢复完成后,数据库无法打开,报错提示

1
2
3
4
Oracle Error: 
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+DG/xxx/datafile/system01.dbf'

用的源库一个rman 0级备份以及后来的部分归档日志,拷贝到这个测试环境,恢复完controlfile之后,进行了基于scn的recover,scn则是restore之后控制文件里显示的scn号

1
recover database until scn 75238614094;

尝试open时报错

1
2
3
4
5
6
16:54:07 sys. >alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+DG/xxx/datafile/system01.dbf'

但是检查scn均一致,有点奇怪,因为一般来说如果这三个视图的checkpoint_change#一致的话,数据库是应该可以正常打开的。

1
2
3
4
5
sys >select (select distinct checkpoint_change# from v$datafile) dfile_ckp,(select checkpoint_change# from v$database) db_ckp,(select distinct checkpoint_change# from v$datafile_header) dh_ckp from dual;

DFILE_CKP DB_CKP DH_CKP
------------------ ------------------ ------------------
75238614094 75238614094 75238614094

检测数据文件的状态

1
2
3
4
5
6
7
8
9
10
set numwidth 30;
set pagesize 50000;
alter session set nls_date_format = 'DD-MON-RRRR HH24:MI:SS';

sys.>select status,checkpoint_change#,checkpoint_time, resetlogs_change#, resetlogs_time, count(*), fuzzy from v$datafile_header group by status,checkpoint_change#,checkpoint_time, resetlogs_change#, resetlogs_time, fuzzy;

STATUS CHECKPOINT_CHANGE# CHECKPOINT_TIME RESETLOGS_CHANGE# RESETLOGS_TIME COUNT(*) FUZ
------- ------------------ ----------------------- ------------------------------ ----------------------- ------------------------------ ---
ONLINE 75238614094 23-FEB-2019 00:15:46 67622732054 02-NOV-2018 22:00:36 45 NO
ONLINE 75238614094 23-FEB-2019 00:15:46 67622732054 02-NOV-2018 22:00:36 8 YES >>>>====有8个文件的状态不一致

fuzzy值为yes表示数据文件在checkpoint以后仍然有些写入的动作,比如有一些比存储在v$datafile_header.checkpoint_change#字段里scn更高的scn事务对数据文件进行了修改操作,所以为了让数据文件保持一致性,则需要前滚应用日志

查看需要哪些归档日志来恢复数据文件

1
2
3
4
5
select min(fhrba_Seq), max(fhrba_Seq) from X$KCVFH;

MIN(FHRBA_SEQ) MAX(FHRBA_SEQ)
------------------------------ ------------------------------
21201 21201

这个查询结果表示所需要的最小和最大日志文件,为了让数据文件保持一致,你必须要应用这之间所有的日志。

查看当前使用的controlfile

1
2
3
4
5
select name, controlfile_type from v$database ;

NAME CONTROL
------------------------------ -------
XXX BACKUP >>>>====使用恢复出来的控制文件

查看当前redo log

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
select     substr(L.GROUP#,1,6)       GROUP#
,substr(L.THREAD#,1,7) THREAD#
,substr(L.SEQUENCE#,1,10) SEQUENCE#
,substr(L.MEMBERS,1,7) MEMBERS
,substr(L.ARCHIVED,1,8) ARCHIVED
,substr(L.STATUS,1,10) STATUS
,substr(L.FIRST_CHANGE#,1,16) FIRST_CHANGE#
,substr(LF.member,1,60) REDO_LOGFILE
from GV$LOG L, GV$LOGFILE LF
where L.GROUP# = LF.GROUP# ;

GROUP# THREAD# SEQUENCE# MEMBERS ARCHIV STATUS FIRST_CHANGE# REDO_LOGFILE
------------ -------------- -------------------- -------------- ------ -------------------- -------------------------------- ----------------------------------
1 1 21201 1 YES ACTIVE 75238614094 /home/oradata/xxx/redo01.log
2 1 21199 1 YES INACTIVE 75238587795 /home/oradata/xxx/redo02.log
11 1 21203 1 NO CURRENT 75238632711 /home/oradata/xxx/redo11.log >>>>====这是当前日志,包含最近的redo
10 1 21202 1 YES ACTIVE 75238624057 /home/oradata/xxx/redo10.log
3 1 21200 1 YES INACTIVE 75238601713 /home/oradata/xxx/redo03.log

通过前面查询的数据文件头的最小checkpoint_change#,用于查询所要从哪个日志开始恢复,最小的日志号一直到current redo中间的所有日志都必须可用,才能保证恢复成功。

1
2
3
4
5
6
7
8
 select thread#, sequence#, substr(name,1,80) from v$Archived_log
where 75238614094 between first_change# and next_change#;


THREAD# SEQUENCE# SUBSTR(NAME,1,80)
------------------------------ ------------------------------ ----------------------------------------------------------------------
1 21200 xxxx
1 21200

根据以上结果表示需要恢复21200到21203之间的日志,我们当前已经到了21201,所以总共需要恢复21201和21202 2个文件。

将备份集中的归档日志恢复出来

1
2
3
4
5
6
7
8
9
10
11
RMAN> RESTORE ARCHIVELOG FROM SEQUENCE 21201 UNTIL SEQUENCE 21202;

channel ORA_DISK_1: starting archived log restore to default destination
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=21201
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=21202
channel ORA_DISK_1: reading from backup piece /bak/backup/arch_u8tqihpn_1_1
channel ORA_DISK_1: piece handle=/bak/backup/arch_u8tqihpn_1_1
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:10:35

这里先处理控制文件里的redo路径问题,原路径是文件系统,新环境是asm,所有对所有的redo日志进行了rename

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
select * from v$logfile;

GROUP# STATUS TYPE MEMBER IS_
---------- ------- ------- ---------------------------------------------------------------------- ---
3 ONLINE +REDO/xxx/redo03.log NO
2 ONLINE +REDO/xxx/redo02.log NO
1 ONLINE +REDO/xxx/redo01.log NO
4 STANDBY +REDO/xxx/stdredo01.log NO
5 STANDBY +REDO/xxx/stdredo02.log NO
6 STANDBY +REDO/xxx/stdredo03.log NO
7 STANDBY +REDO/xxx/stdredo04.log NO
10 ONLINE +REDO/xxx/redo10.log NO
11 ONLINE +REDO/xxx/redo11.log NO

ALTER DATABASE RENAME FILE '/home/oradata/xxx/redo03.log' to '+REDO/xxx/redo03.log';
ALTER DATABASE RENAME FILE '/home/oradata/xxx/redo02.log' to '+REDO/xxx/redo02.log';
ALTER DATABASE RENAME FILE '/home/oradata/xxx/redo01.log' to '+REDO/xxx/redo01.log';

ALTER DATABASE RENAME FILE '/home/oradata/xxx/stdredo01.log' to '+REDO/xxx/stdredo01.log';
ALTER DATABASE RENAME FILE '/home/oradata/xxx/stdredo02.log' to '+REDO/xxx/stdredo02.log';
ALTER DATABASE RENAME FILE '/home/oradata/xxx/stdredo03.log' to '+REDO/xxx/stdredo03.log';
ALTER DATABASE RENAME FILE '/home/oradata/xxx/stdredo04.log' to '+REDO/xxx/stdredo04.log';
ALTER DATABASE RENAME FILE '/home/oradata/xxx/redo10.log' to '+REDO/xxx/redo10.log';
ALTER DATABASE RENAME FILE '/home/oradata/xxx/redo11.log' to '+REDO/xxx/redo11.log';

当使用的是备份控制文件时,可以使用基于CANCEL的恢复

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
22:52:18 sys. xxx>RECOVER DATABASE USING BACKUP CONTROLFILE UNTIL CANCEL ;
ORA-00279: change 75238614094 generated at 02/23/2019 00:15:46 needed for thread 1
ORA-00289: suggestion : +NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21201.664.1001796429
ORA-00280: change 75238614094 for thread 1 is in sequence #21201


22:52:35 Spexxxy log: {>RET>=suggested | filename | AUTO | CANCEL}
+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21201.664.1001796429
ORA-00279: change 75238624057 generated at 02/23/2019 00:19:25 needed for thread 1
ORA-00289: suggestion : +NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21202.663.1001796179
ORA-00280: change 75238624057 for thread 1 is in sequence #21202
ORA-00278: log file '+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21201.664.1001796429' no longer needed for this recovery


22:52:49 Spexxxy log: {>RET>=suggested | filename | AUTO | CANCEL}
+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21202.663.1001796179
ORA-00279: change 75238632711 generated at 02/23/2019 00:22:37 needed for thread 1
ORA-00289: suggestion : +nvmedg
ORA-00280: change 75238632711 for thread 1 is in sequence #21203
ORA-00278: log file '+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21202.663.1001796179' no longer needed for this recovery


22:53:03 Spexxxy log: {>RET>=suggested | filename | AUTO | CANCEL} >>>>====缺少进一步的日志用于恢复
auto
ORA-00308: cannot open archived log '+nvmedg'
ORA-17503: ksfdopn:2 Failed to open file +nvmedg
ORA-15045: ASM file name '+nvmedg' is not in reference form


ORA-00308: cannot open archived log '+nvmedg'
ORA-17503: ksfdopn:2 Failed to open file +nvmedg
ORA-15045: ASM file name '+nvmedg' is not in reference form


ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+NVMEDG/xxx/datafile/system01.dbf'

当我恢复完21201和21202两个文件以后,发现依然无法打开数据库。

1
2
3
4
5
6
sys. >alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+DG/xxx/datafile/system01.dbf'

那就表示目前数据文件仍然有不一致的地方,最后需要的部分则在current redo当中,因为我这个是测试环境,所以虽然没有current redo,但是我有21203号之后的归档日志,所以直接采用这些归档日志来进行recover,那具体要恢复到哪个scn呢,可以通过下面语句查询出来

1
2
3
4
5
6
7
8
select  min(FHSCN) "LOW FILEHDR SCN"
, max(FHSCN) "MAX FILEHDR SCN"
, max(FHAFS) "Min PITR ABSSCN"
from X$KCVFH ;

LOW FILEHDR SCN MAX FILEHDR SCN Min PITR ABSSCN
---------------- ---------------- ----------------
75238632711 75238632711 75238770594

“LOW FILEHDR SCN” &#8211; 恢复进程开始的scn

“MAX FILEHDR SCN” &#8211; 为了使所有数据文件保持一致必须要恢复到的scn

IF “Min PITR ABSSCN” != 0 AND > “MAX FILEHDR SCN”

THEN “Min PITR ABSSCN” 则是为了使所有数据文件保持一致必须要恢复到的scn

所以根据上述原则必须要恢复到75238770594

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
RMAN> recover database until scn 75238770594;

Starting recover at 2019-03-01 23:16:30
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=2065 device type=DISK

starting media recovery

channel ORA_DISK_1: starting archived log restore to default destination
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=21203
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=21204
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=21205
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=21206
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=21207
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=21208
channel ORA_DISK_1: reading from backup piece /bak/backup/arch_u8tqihpn_1_1
channel ORA_DISK_1: piece handle=/bak/backup/arch_u8tqihpn_1_1
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:13:25
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21203.677.1001805529 thread=1 sequence=21203
channel default: deleting archived log(s)
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21203.677.1001805529 RECID=66515 STAMP=1001805655
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21204.678.1001805657 thread=1 sequence=21204
channel default: deleting archived log(s)
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21204.678.1001805657 RECID=66516 STAMP=1001805790
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21205.679.1001806059 thread=1 sequence=21205
channel default: deleting archived log(s)
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21205.679.1001806059 RECID=66517 STAMP=1001806190
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21206.675.1001805395 thread=1 sequence=21206
channel default: deleting archived log(s)
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21206.675.1001805395 RECID=66513 STAMP=1001805528
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21207.676.1001805395 thread=1 sequence=21207
channel default: deleting archived log(s)
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21207.676.1001805395 RECID=66512 STAMP=1001805524
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21208.674.1001805393 thread=1 sequence=21208
channel default: deleting archived log(s)
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21208.674.1001805393 RECID=66514 STAMP=1001805536
media recovery complete, elapsed time: 00:00:17
Finished recover at 2019-03-01 23:30:16

这次的恢复就正常没有报错,检查数据文件的scn和状态等

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
sys >select (select distinct checkpoint_change# from v$datafile) dfile_ckp,(select checkpoint_change# from v$database) db_ckp,(select distinct checkpoint_change# from v$datafile_header) dh_ckp from dual;

DFILE_CKP DB_CKP DH_CKP
------------------ ------------------ ------------------
75238770595 75238770595 75238770595

set numwidth 30;
set pagesize 50000;
alter session set nls_date_format = 'DD-MON-RRRR HH24:MI:SS';

select status,checkpoint_change#,checkpoint_time, resetlogs_change#, resetlogs_time, count(*), fuzzy from v$datafile_header group by status,checkpoint_change#,checkpoint_time, resetlogs_change#, resetlogs_time, fuzzy;

STATUS CHECKPOINT_CHANGE# CHECKPOINT_TIME RESETLOGS_CHANGE# RESETLOGS_TIME COUNT(*) FUZ
------- ------------------------------ ----------------------- ------------------------------ ----------------------- ------------------------------ ---
ONLINE 75238770595 01-MAR-2019 23:36:12 75238770595 01-MAR-2019 23:36:12 53 NO

尝试打开数据库

1
2
3
4
5
6
7
8
9
10
11
12
13
14
alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01139: RESETLOGS option only valid after an incomplete database recovery


alter database open;
alter database open
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 97239
Session ID: 1921 Serial number: 1

重建控制文件

1
2
3
4
5
6
7
8
9
10
11
12
CREATE CONTROLFILE REUSE DATABASE "XXX" NORESETLOGS  ARCHIVELOG
......省略

select name, controlfile_type from v$database ;

NAME CONTROLFILE_TY
------------------------------ --------------
XXX CREATED

alter database open;

Database altered.

至此恢复完毕。

Oracle 11g 清理SYSAUX表空间

收到告警一个11g的库sysaux表空间不足,上去检查看到使用率达到了95%,所以接下来就说下如何处理这种情况。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
SELECT Upper(F.TABLESPACE_NAME)        "表空间名",
D.TOT_GROOTTE_MB "表空间大小(M)",
D.TOT_GROOTTE_MB - F.TOTAL_BYTES "已使用空间(M)",
To_char(Round(( D.TOT_GROOTTE_MB - F.TOTAL_BYTES ) / D.TOT_GROOTTE_MB * 100, 2), '990.99')
|| '%' "使用比",
F.TOTAL_BYTES "空闲空间(M)",
F.MAX_BYTES "最大块(M)"
FROM (SELECT TABLESPACE_NAME,
Round(Sum(BYTES) / ( 1024 * 1024 ), 2) TOTAL_BYTES,
Round(Max(BYTES) / ( 1024 * 1024 ), 2) MAX_BYTES
FROM SYS.DBA_FREE_SPACE
GROUP BY TABLESPACE_NAME) F,
(SELECT DD.TABLESPACE_NAME,
Round(Sum(DD.BYTES) / ( 1024 * 1024 ), 2) TOT_GROOTTE_MB
FROM SYS.DBA_DATA_FILES DD
GROUP BY DD.TABLESPACE_NAME) D
WHERE D.TABLESPACE_NAME = F.TABLESPACE_NAME
and D.tablespace_name = 'SYSAUX';

表空间名 表空间大小(M) 已使用空间(M) 使用比 空闲空间(M) 最大块(M)
SYSAUX 32751 31388.75 95.84% 1362.25 1006

在oracle数据库中SYSAUX被认为是SYSTEM表空间的辅助表空间,作为oracle的默认表空间来支持oracle的各种产品和特性,从而降低对system表空间的压力。

SYSAUX表空间的限制

  • SYSAUX只能由一个数据文件组成
  • 无法对SYSAUX表空间采用类似(PERMANENT, READ WRITE,EXTENT MANAGMENT LOCAL,SEGMENT SPACE MANAGMENT AUTO)的ALTER TABLESPACE操作
  • SYSAUX表空间不能被RENAME或DROP

什么占用了SYSAUX表空间

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
sys@> select occupant_name,occupant_desc,space_usage_kbytes from v$sysaux_occupants;

OCCUPANT_NAME OCCUPANT_DESC SPACE_USAGE_KBYTES
---------------------------------------------------------------- ---------------------------------------------------------------- ------------------
LOGMNR LogMiner 13696
LOGSTDBY Logical Standby 1408
SMON_SCN_TIME Transaction Layer - SCN to TIME mapping 3328
PL/SCOPE PL/SQL Identifier Collection 1600
STREAMS Oracle Streams 1024
AUDIT_TABLES DB audit tables 320
XDB XDB 129984
AO Analytical Workspace Object Table 39104
XSOQHIST OLAP API History Tables 39104
XSAMD OLAP Catalog 5248
SM/AWR Server Manageability - Automatic Workload Repository 30022848
SM/ADVISOR Server Manageability - Advisor Framework 189312
SM/OPTSTAT Server Manageability - Optimizer Statistics History 1424768
SM/OTHER Server Manageability - Other Components 12416
STATSPACK Statspack Repository 0
SDO Oracle Spatial 76032
WM Workspace Manager 3584
ORDIM Oracle Multimedia ORDSYS Components 448
ORDIM/ORDDATA Oracle Multimedia ORDDATA Components 13888
ORDIM/ORDPLUGINS Oracle Multimedia ORDPLUGINS Components 0
ORDIM/SI_INFORMTN_SCHEMA Oracle Multimedia SI_INFORMTN_SCHEMA Components 0
EM Enterprise Manager Repository 47168
TEXT Oracle Text 3712
ULTRASEARCH Oracle Ultra Search 0
ULTRASEARCH_DEMO_USER Oracle Ultra Search Demo User 0
EXPRESSION_FILTER Expression Filter System 3712
EM_MONITORING_USER Enterprise Manager Monitoring User 1408
TSM Oracle Transparent Session Migration User 0
SQL_MANAGEMENT_BASE SQL Management Base Schema 1728
AUTO_TASK Automated Maintenance Tasks 384
JOB_SCHEDULER Unified Job Scheduler 14528

很明显的看到绝大部分是AWR占用了

通过语句可以查看是哪些表占用的表空间最大

1
2
3
4
col segment_name for a50
col partition_name for a30
col segment_type for a30
select * from (select segment_name,PARTITION_NAME,segment_type,bytes/1024/1024 from dba_segments where tablespace_name='SYSAUX' order by 4 desc) where rownum>=10;

查看awr的保存期限

1
2
3
4
5
sys@> select retention from dba_hist_wr_control;

RETENTION
---------------------------------------------------------------------------
+00008 00:00:00.0

也就8天时间,已经比较短了,如果时间太长可以调整时间

1
execute dbms_workload_repository.modify_snapshot_settings (interval => 60,retention => 10080);  >>>>==== 间隔60分钟,保存10080分钟,也就是7*24*60

这里要注意,上面这个语句设置的时间要配合MOVING_WINDOW_SIZE参数

1
2
3
4
5
exec DBMS_WORKLOAD_REPOSITORY.modify_baseline_window_size( window_size =>7);

SELECT moving_window_size
FROM dba_hist_baseline
WHERE baseline_type = 'MOVING_WINDOW';

清除awr历史数据

既然设置的时间是合理的,那就只能清除历史数据了,查看最近和最久的时间

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
sys@TFDW1> SELECT
snap_id, begin_interval_time, end_interval_time
FROM
SYS.WRM$_SNAPSHOT
WHERE
snap_id = ( SELECT MIN (snap_id) FROM SYS.WRM$_SNAPSHOT)
UNION
SELECT
snap_id, begin_interval_time, end_interval_time
FROM
SYS.WRM$_SNAPSHOT
WHERE
snap_id = ( SELECT MAX (snap_id) FROM SYS.WRM$_SNAPSHOT)
/

SNAP_ID BEGIN_INTERVAL_TIME END_INTERVAL_TIME
---------- --------------------------------------------------------------------------- ---------------------------------------------------------------------------
16873 13-2月 -19 10.00.11.981 下午 13-2月 -19 11.00.15.959 下午
16873 13-2月 -19 10.00.12.767 下午 13-2月 -19 11.00.16.763 下午
17083 22-2月 -19 04.00.35.330 下午 22-2月 -19 05.00.41.087 下午
17083 22-2月 -19 04.00.35.988 下午 22-2月 -19 05.00.41.686 下午

清除snap_id介于16873和17083的数据

1
2
3
4
BEGIN
dbms_workload_repository.drop_snapshot_range(low_snap_id => 16873, high_snap_id=>17083);
END;
/

这语句执行了很长时间都没反应,最终放弃。采用另外一个办法

1
2
3
SQL> connect / as sysdba
SQL> @?/rdbms/admin/catnoawr.sql
SQL> @?/rdbms/admin/catawrtb.sql

其实就是把awr相关的对象重建了一遍,如果觉得风险大可以通过truncate的方式

1
2
3
4
5
6
select distinct 'truncate  table  '||segment_name||';',s.bytes/1024/1024
from dba_segments s
where s.segment_name like 'WRH$%'
and segment_type in ('TABLE PARTITION', 'TABLE')
and s.bytes/1024/1024>100
order by s.bytes/1024/1024/1024 desc;

清除logminer数据

还有一种情况是因为logminer数据占用的sysaux表空间,一般是由于执行了一些大的dml操作,导致产生了大量的日志。目前没有一个好的办法去直接收回这些空间,但是可以将这些数据迁移到别的表空间。

1
2
SQL> connect / as sysdba
SQL> EXECUTE DBMS_LOGMNR_D.SET_TABLESPACE ('LOGMNRTS');

重新查看SYSAUX表空间使用情况

1
2
3
4
5
sys@> select * from dba_tablespace_usage_metrics where tablespace_name='SYSAUX';

TABLESPACE_NAME USED_SPACE TABLESPACE_SIZE USED_PERCENT
------------------------------ ---------- --------------- ------------
SYSAUX 267848 4194302 6.385997

ORA-16086: standby database does not contain available standby log files

早上来检查告警邮件,一套10.2.0.5的dg主库一直在报ora-16086错误,根据错误提示大概是备库没有可用的standby log files。在11g以后,这个错误名称换成了ORA-16086 Redo data cannot be written to the standby redo log

1
2
3
4
5
16086, 0000, "standby database does not contain available standby log files"
// *Cause: The primary database is in "no data loss" mode, but the standby
// database does not contain any "standby log files".
// *Action: Add one or more standby log files to the standby database.
// This can be done while the standby database is mounted.

当前主备同步依旧正常,切换日志也能顺利传到备库,备库也能正常应用。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
主:
SQL> archive log list;
Database log mode Archive Mode
Automatic archival Enabled
Archive destination /u01/arch
Oldest online log sequence 384007
Next log sequence to archive 384012
Current log sequence 384012

备:
SQL> archive log list;
Database log mode Archive Mode
Automatic archival Enabled
Archive destination /u01/arch
Oldest online log sequence 383998
Next log sequence to archive 0
Current log sequence 384012

主备库均为最大可用模式

1
2
3
4
5
SQL> select protection_mode,protection_level from v$database;

PROTECTION_MODE PROTECTION_LEVEL
-------------------- --------------------
MAXIMUM AVAILABILITY RESYNCHRONIZATION

而主备库的standby log file均存在,且大小一致,状态正常

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
SQL> select * from v$logfile;

GROUP# STATUS TYPE MEMBER IS_
---------- ------- ------- -------------------------------------------------------------------------------- ---
1 ONLINE /u01/app/oracle/oradata/test/redo01.log NO
2 ONLINE /u01/app/oracle/oradata/test/redo02.log NO
3 ONLINE /u01/app/oracle/oradata/test/redo03.log NO
4 ONLINE /u01/app/oracle/oradata/test/redo04.log NO
5 ONLINE /u01/app/oracle/oradata/test/redo05.log NO
6 ONLINE /u01/app/oracle/oradata/test/redo06.log NO
7 STANDBY /u01/app/oracle/oradata/test/redo07.log NO
8 STANDBY /u01/app/oracle/oradata/test/redo08.log NO
9 STANDBY /u01/app/oracle/oradata/test/redo09.log NO
10 STANDBY /u01/app/oracle/oradata/test/redo10.log NO
11 STANDBY /u01/app/oracle/oradata/test/redo11.log NO
12 STANDBY /u01/app/oracle/oradata/test/redo12.log NO
13 STANDBY /u01/app/oracle/oradata/test/redo13.log NO

MOS上有说可能是因为fra目录满导致的bug,但检查数据库并没有启用fra

1
2
3
4
5
6
7
SQL> show parameter recovery

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
db_recovery_file_dest string
db_recovery_file_dest_size big integer 0
recovery_parallelism integer 0

尝试重建standby log file,主备库均删除重建

1
2
3
4
5
6
7
8
9
10
11
12
13
alter database drop standby logfile group 7;
alter database drop standby logfile group 8;
alter database drop standby logfile group 9;
alter database drop standby logfile group 10;
alter database drop standby logfile group 11;
alter database drop standby logfile group 12;
alter database drop standby logfile group 13;

select 'alter database add standby logfile group '||group#||' '''||member||''' size 1g reuse;' from v$logfile where type='STANDBY';

recover managed standby database cancel;

recover managed standby database using current logfile disconnect from session;

观察了一小时,并未再出现告警。

ORA-04031一例 ("shared pool","…","SQLA","tmp")

今天突然收到告警,一台11.2.0.4的数据库报错ORA-04031。

1
2
SYS@> show sga        
ORA-04031: 无法分配 32 字节的共享内存 ("shared pool","SELECT DECODE(null,'','Total...","SQLA","tmp")

根据错误代码可以判断是共享池使用出了问题

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[oracle@xxx trace]$ oerr ora 4031
04031, 00000, "unable to allocate %s bytes of shared memory (\"%s\",\"%s\",\"%s\",\"%s\")"
// *Cause: More shared memory is needed than was allocated in the shared
// pool or Streams pool.
// *Action: If the shared pool is out of memory, either use the
// DBMS_SHARED_POOL package to pin large packages,
// reduce your use of shared memory, or increase the amount of
// available shared memory by increasing the value of the
// initialization parameters SHARED_POOL_RESERVED_SIZE and
// SHARED_POOL_SIZE.
// If the large pool is out of memory, increase the initialization
// parameter LARGE_POOL_SIZE.
// If the error is issued from an Oracle Streams or XStream process,
// increase the initialization parameter STREAMS_POOL_SIZE or increase
// the capture or apply parameter MAX_SGA_SIZE.

从smon日志中可以看到大量的sga组件等待扩展的信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Session Wait History:
elapsed time of 0.261076 sec since last wait
0: waited for 'SGA: allocation forcing component growth'
=0x0, =0x0, =0x0
wait_id=2337078 seq_num=43657 snap_id=101
wait times: snap=0.000000 sec, exc=5.069536 sec, total=13.674583 sec
wait times: max=infinite
wait counts: calls=100 os=100
occurred after 0.000000 sec of elapsed time
1: waited for 'SGA: allocation forcing component growth'
=0x0, =0x0, =0x0
wait_id=2337178 seq_num=43656 snap_id=1
wait times: snap=0.003531 sec, exc=0.003531 sec, total=0.003531 sec
wait times: max=infinite
wait counts: calls=2 os=2
occurred after 0.000000 sec of elapsed time
2: waited for 'SGA: allocation forcing component growth'
=0x0, =0x0, =0x0
wait_id=2337078 seq_num=43655 snap_id=100
wait times: snap=0.050112 sec, exc=5.069536 sec, total=13.671052 sec
wait times: max=infinite
wait counts: calls=100 os=100
occurred after 0.000000 sec of elapsed time
...省略

表示sga所分配的内存已经使用完,不足以支撑组件的扩展。这个库使用的sga_target,所以sga的各个组件都能自适应分配内存大小

1
2
3
PARAMETER_NAME                                               TYPE        VALUE
------------------------------------------------------------ ----------- ------------------------------------------------------
sga_target big integer 12G

在sga中有6个子池,每个子池中大部分内存都分配给了”KGH: NO ACCESS”,都在1G左右。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
==============================================
TOP 10 MEMORY USES FOR SGA HEAP SUB POOL 1
----------------------------------------------
"KGH: NO ACCESS " 1118 MB 67%
"free memory " 205 MB 12%
"KGLH0 " 86 MB 5%
"SQLA " 70 MB 4%
"KGLS " 21 MB 1%
"KGLHD " 15 MB 1%
"VIRTUAL CIRCUITS " 15 MB 1%
"db_block_hash_buckets " 13 MB 1%
"kglsim object batch " 10 MB 1%
"private strands " 9443 KB 1%
-----------------------------------------
free memory 205 MB
memory alloc. 1459 MB
Sub total 1664 MB
==============================================
TOP 10 MEMORY USES FOR SGA HEAP SUB POOL 2
----------------------------------------------
"KGH: NO ACCESS " 1277 MB 77%
"free memory " 260 MB 16%
"db_block_hash_buckets " 13 MB 1%
"kglsim object batch " 11 MB 1%
"ASH buffers " 10 MB 1%
"private strands " 9443 KB 1%
"event statistics per sess " 8335 KB 0%
"ksunfy : SSO free list " 7929 KB 0%
"dbktb: trace buffer " 6848 KB 0%
"kglsim heap " 6444 KB 0%
-----------------------------------------
free memory 260 MB
memory alloc. 1404 MB
Sub total 1664 MB
==============================================
TOP 10 MEMORY USES FOR SGA HEAP SUB POOL 3
----------------------------------------------
"KGH: NO ACCESS " 1086 MB 68%
"free memory " 192 MB 12%
"KGLH0 " 94 MB 6%
"SQLA " 39 MB 2%
"FileOpenBlock " 30 MB 2%
"KGLHD " 16 MB 1%
"db_block_hash_buckets " 13 MB 1%
"enqueue " 12 MB 1%
"kglsim object batch " 11 MB 1%
"KGLS " 10 MB 1%
-----------------------------------------
free memory 192 MB
memory alloc. 1408 MB
Sub total 1600 MB
==============================================
TOP 10 MEMORY USES FOR SGA HEAP SUB POOL 4
----------------------------------------------
"KGH: NO ACCESS " 1182 MB 67%
"free memory " 243 MB 14%
"KGLH0 " 81 MB 5%
"SQLA " 48 MB 3%
"KQR M PO " 38 MB 2%
"KGLS " 17 MB 1%
"db_block_hash_buckets " 13 MB 1%
"KGLHD " 12 MB 1%
"kglsim object batch " 10 MB 1%
"ASH buffers " 10 MB 1%
-----------------------------------------
free memory 243 MB
memory alloc. 1517 MB
Sub total 1760 MB
==============================================
TOP 10 MEMORY USES FOR SGA HEAP SUB POOL 5
----------------------------------------------
"KGH: NO ACCESS " 1271 MB 74%
"free memory " 221 MB 13%
"SQLA " 51 MB 3%
"KGLH0 " 45 MB 3%
"db_block_hash_buckets " 13 MB 1%
"ASH buffers " 10 MB 1%
"kglsim object batch " 9622 KB 1%
"private strands " 9443 KB 1%
"event statistics per sess " 8335 KB 0%
"ksunfy : SSO free list " 7942 KB 0%
-----------------------------------------
free memory 221 MB
memory alloc. 1507 MB
Sub total 1728 MB
==============================================
TOP 10 MEMORY USES FOR SGA HEAP SUB POOL 6
----------------------------------------------
"KGH: NO ACCESS " 1182 MB 70%
"free memory " 232 MB 14%
"KGLH0 " 83 MB 5%
"SQLA " 48 MB 3%
"KGLHD " 13 MB 1%
"db_block_hash_buckets " 13 MB 1%
"KGLS " 13 MB 1%
"kglsim object batch " 12 MB 1%
"private strands " 9576 KB 1%
"event statistics per sess " 8335 KB 0%
-----------------------------------------
free memory 232 MB
memory alloc. 1464 MB
Sub total 1696 MB
TOTALS ---------------------------------------
Total free memory 1571 MB
Total memory alloc. 10 GB
Grand total 11 GB
==============================================

在每个子池中,会有4个持续时间部分(durations),分别是”instance”, “session”, “cursor”, 和 “execution”。 主要目的是为了根据不同的事务种类对共享池申请的内存大小和方式的不同,而将会造成大量碎片、或者能重用的任务区分开来,它们之间互相独立、互不干扰。比如cursor所需内存来源于duration2,execution所需内存来源于duration3等等。

12c以前的版本

12c以后的版本

为了能使shared pool里的内存得到充分利用,我们通过隐含参数来去掉durations的限制。

1
2
3
4
       NUM N_HEX NAME                                                   VALUE                          DESCRIPTION
---------- ----- ------------------------------------------------------ ------------------------------ ---------------------------------------------
111 6F _enable_shared_pool_durations TRUE temporary to disable/enable kgh policy

最后解决办法

1
2
SQL> alter system set "_enable_shared_pool_durations"=false scope=spfile;
- restart the database

设置"_enable_shared_pool_durations = false"的主要好处在于所有的durations会合并成一个池,所以就不会出现其中一个duration内存不足而其他duration还有空余的情况。

这个问题将在12c版本以后解决,由于体系的改变导致可以允许子池减少对duration的需求。

MMON进程未自动重启导致awr未生成

问题现象

今天在生成一套rac的awr报告时,发现只有一号节点的数据,这是一套11.2.0.4的rac。

1
2
3
sys@TFDW1> select * from Dba_Hist_Snapshot where instance_number=2;

未选定行

查看mmon进程信息,发现二号节点并没有这个进程

1
2
3
[root@tfdw2 ~]# ps -ef|grep mmon
grid 38361 1 0 Feb24 ? 01:30:51 asm_mmon_+ASM2
root 126466 126429 0 16:04 pts/1 00:00:00 grep mmon

而一号节点是正常的

1
2
3
4
[root@tfdw1 ~]# ps -ef|grep mmon
grid 37050 1 0 Dec10 ? 00:03:04 asm_mmon_+ASM1
oracle 53834 1 0 Dec10 ? 00:12:44 ora_mmon_tfdw1
root 79038 42730 0 16:04 pts/0 00:00:00 grep mmon

要知道mmon进程属于可以被pmon respawned的进程之一,也就是pmon会监控这些进程的运行状况,发现异常则会进行重启,而现在的现象确实进程不存在。

查看alert日志:

1
2
3
Receiver: inst 1 binc 429508292 ospid 52558
IPC Send timeout to 1.4 inc 4 for msg type 65518 from opid 16
IPC Send timeout detected. Sender: ospid 51409 [oracle@tfdw2 (MMON)]

想起来了是之前单独重启过一号实例,二号实例给一号实例发送通信信息时,返回超时(默认300s)。

查看MMON日志

1
2
3
4
5
6
7
8
9
10
11
12
DDE rules only execution for: ORA 12751
----- START Event Driven Actions Dump ----
---- END Event Driven Actions Dump ----
----- START DDE Actions Dump -----
Executing SYNC actions
Executing ASYNC actions
----- START DDE Action: 'ORA_12751_DUMP' (Sync) -----
Runtime exceeded 300 seconds
Time limit violation detected at:
ksedsts()+465>-kspol_12751_dump()+145>-dbgdaExecuteAction()+1065>-dbgerRunAction()+109>-dbgerRunActions()+4134>-dbgexPhaseII()+1873>-dbgexProcessError()+2680>-dbgeExecuteForError()+88>-dbgePostErrorKGE()+2136>-
dbkePostKGE_kgsf()+71>-kgeselv()+276>-ksesecl0()+162>-ksucin()+147>-kjuinq()+2316>-ksiinqi()+245>-ksqinq()+448>-kclriprog()+45>-ktucloRacMasterAction()+479>-ktucloRacAction()+861>-ksb_run_managed_action()+384>-ksbcti()+852>-ksbabs()+1735>-kebm_mmon_main()+209>-k
sbrdp()+1045>-opirip()+623>-opidrv()+603>-sou2o()+103>-opimai_real()+250>-ssthrdmain()+265>-main()+201>-__libc_start_main()+253Current Wait Stack:

看起来就是通信异常,导致在reconfig的时候进程异常被pmon终止了。

现在问题时在通信恢复以后为什么进程没有重新启动。

MOS上可以查到是个bug

Bug 19565533 &#8211; MMON/MMNL CAN NOT BE AUTOMATICALLY RESTARTED AFTER KILLED

The issue only occurs on 11.2.0.4 and is a regression of the fix for Bug 14213856

解决办法

  • 打补丁patch 19565533
  • 绕过办法:
    • 重启实例

    • alter system enable restricted session;

      alter system disable restricted session;

Tips:启用restricted模式后,应用新连接数据库的会话将直接报错ORA-12526无法连接