分类目录归档:Troubleshooting

EXPDP Streams AQ: enqueue blocked on low memory

有一台11.2.0.4的库最近expdp导出非常慢,总是停留在99%以后还要持续2个多小时,还有另外一台同版本的数据库也碰到了类似的情况。

Job: SYS_EXPORT_SCHEMA_01
  Owner: XXX                    
  Operation: EXPORT                         
  Creator Privs: TRUE                           
  GUID: 8A27A5FA09F142F9E055000000000001
  Start Time: 
  Mode: SCHEMA                         
  Instance: 
  Max Parallelism: 4
  EXPORT Job Parameters:
  Parameter Name      Parameter Value:
     CLIENT_COMMAND        backup_user/******** dumpfile=
     COMPRESSION           ALL                                     
  State: EXECUTING                      
  Bytes Processed: 45,234,549,312
  Percent Done: 99
  Current Parallelism: 4
  Job Error Count: 0

当前导出的对象都是一些空的分区,查询导出任务是否在等待

SQL> select dw.waiting_session,
       dw.holding_session,
       b.serial#,
       w.event,
       w.program          wprogram,
       b.program          bprogram,
       w.module           wmod,
       b.module           bmod,
       LOCK_ID1
  from sys.dba_waiters dw, v$session w, v$session b
 where dw.waiting_session = w.sid
   and dw.holding_session = b.sid
   and (w.module like 'Data Pump%' or w.program like '%EXPDP%' or
       w.program like '%IMPDP%')
 order by dw.holding_session;

 no rows selected


SQL> select sw.SID,
       sw.SEQ#,
       sw.EVENT,
       sw.WAIT_TIME,
       sw.SECONDS_IN_WAIT,
       sw.STATE,
       sw.P1TEXT,
       sw.P1,
       sw.P2TEXT,
       sw.P2,
       sw.P3TEXT,
       sw.P3
  from V$SESSION_WAIT sw, v$session s
 where sw.wait_class <> 'Idle'
   and sw.sid = s.sid
   and (s.module like 'Data Pump%' or s.program like '%EXPDP%' or
       s.program like '%IMPDP%');

      SID    SEQ# EVENT                             WAIT_TIME SECONDS_IN_WAIT STATE           P1TEXT         P1 P2TEXT         P2 P3TEXT         P3
---------- ---------- ---------------------------------------------------------------- ---------- --------------- ------------------- ---------- ---------- ---------- ---------- ---------- ----------
      1801  31734 Streams AQ: enqueue blocked on low memory                 0       1 WAITING                 0         0             0
      2540   3931 Streams AQ: enqueue blocked on low memory                 0       1 WAITING                 0         0             0

等待显示导出队列在等待与streams相关的内存,在sga中就是streams pool,查看当前streams pool的使用情况

select component
,current_size/1024/1024 "Current MB"
,min_size/1024/1024 "Min MB"
from v$sga_dynamic_components;

COMPONENT                            Current MB Min MB
---------------------------------------------------------------- ---------- ----------
shared pool                                3232   1472
large pool                               64     64
java pool                                32     32
streams pool                                0        0
DEFAULT buffer cache                               6624   6432
KEEP buffer cache                             0      0
RECYCLE buffer cache                              0      0
DEFAULT 2K buffer cache                           0      0
DEFAULT 4K buffer cache                           0      0
DEFAULT 8K buffer cache                           0      0
DEFAULT 16K buffer cache                          0      0
DEFAULT 32K buffer cache                          0      0
Shared IO Pool                                0      0
ASM Buffer Cache                              0      0

因为这个库设置了sga_target,相关组件的内存大小分配是由oracle自动来控制的,所以可能碰上了bug。

Workaround:

  1. 重启数据库,通常能解决这个问题
  2. 修改参数,设定streams_pool_size的最小值
CONNECT / as sysdba
ALTER SYSTEM SET streams_pool_size=150m SCOPE=both;
  1. 如果无法动态调整,则需要重启
CONNECT / as sysdba
ALTER SYSTEM SET streams_pool_size=150m SCOPE=spfile;
SHUTDOWN IMMEDIATE
STARTUP
  1. 打补丁Patch 17365043
    或者修改隐含参数
alter system set "_disable_streams_pool_auto_tuning"=TRUE;

做完以后重新执行datapump任务。

ORA-15001: diskgroup “FRA” does not exist or is not mounted

一台数据库在启动时候报错,提示一个asm group不存在

ORA-00210: cannot open the specified control file
ORA-00202: control file: '+FRA/STBRAC/CONTROLFILE/current.294.1005908473'
ORA-17503: ksfdopn:2 Failed to open file +FRA/STBRAC/CONTROLFILE/current.294.1005908473
ORA-15001: diskgroup "FRA" does not exist or is not mounted
ORA-15040: diskgroup is incomplete
ORA-205 signalled during: alter database mount...


WARNING: failed to open a disk[/dev/asmdiskd]
ORA-15025: could not open disk "/dev/asmdiskd"
ORA-27041: unable to open file
Linux-x86_64 Error: 13: Permission denied

检查asm的磁盘等信息均正常

[grid@stbracnode1 ~]$ asmcmd
ASMCMD> lsdg
State    Type    Rebal  Sector  Logical_Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N         512             512   4096  4194304     51200    16784                0           16784              0             Y  CRS/
MOUNTED  EXTERN  N         512             512   4096  4194304     61440    55336                0           55336              0             N  DATA/
MOUNTED  EXTERN  N         512             512   4096  4194304     20480    17404                0           17404              0             N  FRA/

[grid@stbracnode1 ~]$ ll /dev/asm*
brw-rw---- 1 grid asmadmin 8, 16 Apr 18 13:22 /dev/asmdiskb
brw-rw---- 1 grid asmadmin 8, 32 Apr 18 13:22 /dev/asmdiskc
brw-rw---- 1 grid asmadmin 8, 48 Apr 18 13:22 /dev/asmdiskd

查询MOS说是权限出现了问题,ORACLE_HOME目录下的oracle二进制文件的属主应该是asmadmin,而不是oinstall。

[oracle@ ~]$ ll /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle
-rwsr-s--x 1 oracle oinstall 407944928 Apr 16 15:53 /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle

查看其它正常的asm节点

[oracle@ dbs]$ ll /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle
-rwsr-s--x 1 oracle asmadmin 407944928 Apr 16 15:48 /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle

这台集群刚刚进行了grid和database的补丁升级,应该是由于这个原因导致属主进行了变更

知道问题原因以后,解决办法就简单了,切换到grid用户下

setasmgidwrap o=/u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle

[grid@ ~]$ ll /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle
-rwsr-s--x 1 oracle asmadmin 407944928 Apr 16 15:56 /u01/app/oracle/product/12.2.0/dbhome_1/bin/oracle

重新启动数据库,问题解决

SQL> startup mount
ORACLE instance started.

Total System Global Area 2516582400 bytes
Fixed Size          8795904 bytes
Variable Size         754977024 bytes
Database Buffers     1744830464 bytes
Redo Buffers            7979008 bytes
Database mounted.

ORA-20200: Database/Instance does not exist in AWR_PDB_DATABASE_INSTANCE

当我在12.2中通过pdb登录生成awr时报错

[oracle@xb admin]$ sqlplus sys/oracle@pdb12c as sysdba

sys@PDB12C> @?/rdbms/admin/awrsqrpt.sql

Specify the Report Type
~~~~~~~~~~~~~~~~~~~~~~~
Would you like an HTML report, or a plain text report?
Enter 'html' for an HTML report, or 'text' for plain text
Defaults to 'html'
Enter value for report_type: 

Type Specified:  html

Specify the location of AWR Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
AWR_ROOT - Use AWR data from root (default)
AWR_PDB - Use AWR data from PDB
Enter value for awr_location: AWR_PDB

Location of AWR Data Specified: AWR_PDB          <<<<====这里选择的AWR_PDB

declare
*
ERROR at line 1:
ORA-20200: Database/Instance 3393322654/1 does not exist in
AWR_PDB_DATABASE_INSTANCE
ORA-06512: at line 27

但是如果选择AWR_ROOT时则正常

Location of AWR Data Specified: AWR_ROOT        <<<<====

...

Current Instance
~~~~~~~~~~~~~~~~
...

   DB Id    DB Name      Inst Num Instance
----------- ------------ -------- ------------
  380590835 ORA12C              1 ora12c

...

Instances in this Workload Repository schema
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  DB Id      Inst Num   DB Name      Instance     Host
------------ ---------- ---------    ----------   ------
* 380590835      1      ORA12C       ora12c       xb.oracle.co

Using  380590835 for database Id
Using          1 for instance number

Enter value for num_days: 

换成CDB登录一样正常,因为AWR Data默认就是CDB。

这是由于参数Awr_pdb_autoflush_enabled默认为FALSE,阻止了AWR相关视图的数据写入,所以接下来就说一下如何开启PDB级别的awr自动快照。

Oracle 12.2开始允许在CDB或PDB级别的AWR快照,CDB级别的快照包括CDB和PDB的统计信息。

AWR报告会列出所有可用容器库的相关信息,下图为CBD级别awr报告中展示的部分sql统计信息,包含PDB和根CDB

自动快照默认是在CDB级别开启,在PDB级别关闭,oracle建议是在PDB级别进行手工收集,或者只对单独有性能问题的PDB开启自动快照。
可以通过将参数Awr_pdb_autoflush_enabled设为TRUE,如果在CDB级别设置这个参数,则会对所有PDB都进行开启自动快照,如果只在PDB级别设置,则只会影响单个PDB。

当设置完参数以后,登录到pdb里面,通过DBMS_WORKLOAD_REPOSITORY.MODIFY_SNAPSHOT_SETTINGS过程将interval设为大于0的值。

sys@ORA12C> show pdbs

    CON_ID CON_NAME                       OPEN MODE  RESTRICTED
---------- ------------------------------ ---------- ----------
         2 PDB$SEED                       READ ONLY  NO
         3 PDB12C                         READ WRITE NO

--cdb级别开启
ALTER SYSTEM SET awr_pdb_autoflush_enabled= TRUE SID='*' SCOPE=BOTH;
ALTER SYSTEM SET awr_snapshot_time_offset=1000000 SID='*' SCOPE=BOTH;

参数 awr_snapshot_time_offset 默认为0. 这里设成1000000主要是为了对不同pdb在创建awr时能有不同的时间偏移量,这样就能避免cpu瞬间增高

将默认值保持为零可能会在资源较低的系统中导致CPU峰值

登录pdb,执行MODIFY_SNAPSHOT_SETTINGS

sys@ORA12C> alter session set container=pdb12c;

Session altered.

sys@ORA12C> select * from awr_pdb_wr_control;

      DBID SNAP_INTERVAL                                                               RETENTION                                                                   TOPNSQL        CON_ID
---------- --------------------------------------------------------------------------- ------------------------------------------------ ---------- ----------
3393322654 +40150 00:01:00.0                                                           +00008 00:00:00.0                                                           DEFAULT             3


BEGIN
 DBMS_WORKLOAD_REPOSITORY.MODIFY_SNAPSHOT_SETTINGS( retention => 11520, 
                                                   interval => 60,
                                                   topnsql => 34,
                                                   dbid => 3393322654);
END;
/


sys@ORA12C> select * from awr_pdb_wr_control;

      DBID SNAP_INTERVAL                                                               RETENTION                                                                   TOPNSQL        CON_ID
---------- --------------------------------------------------------------------------- --------------------------------------------------------------------------- ---------- ----------
3393322654 +00000 01:00:00.0                                                           +00008 00:00:00.0                                                                   34          3

当第一个snapshot生成以后,AWR_PDB_DATABASE_INSTANCE则会包含PDB的数据

sys@ORA12C> show con_name

CON_NAME
------------------------------
PDB12C

sys@ORA12C> select con_id,dbid,INSTANCE_NUMBER,INSTANCE_NAME,CDB,LAST_ASH_SAMPLE_ID from AWR_PDB_DATABASE_INSTANCE;

    CON_ID       DBID INSTANCE_NUMBER INSTANCE_NAME    CDB LAST_ASH_SAMPLE_ID
---------- ---------- --------------- ---------------- --- ------------------
         3 3393322654               1 ora12c           YES             115051

那重新来执行之前报错的awr脚本

sys@ORA12C> @?/rdbms/admin/awrrpt.sql

Specify the Report Type
~~~~~~~~~~~~~~~~~~~~~~~
AWR reports can be generated in the following formats.  Please enter the
name of the format at the prompt.  Default value is 'html'.

'html'          HTML format (default)
'text'          Text format
'active-html'   Includes Performance Hub active report

Enter value for report_type: 

Type Specified:  html

Specify the location of AWR Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
AWR_ROOT - Use AWR data from root (default)
AWR_PDB - Use AWR data from PDB
Enter value for awr_location: AWR_PDB

Location of AWR Data Specified: AWR_PDB             <<<<====选择AWR_PDB并未报错

...

Specify the number of days of snapshots to choose from
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Entering the number of days (n) will result in the most recent
(n) days of snapshots being listed.  Pressing <return> without
specifying a number lists all completed snapshots.


Enter value for num_days: 1

oracle无法启动ORA-27154,ORA-27300,ORA-27301,ORA-27302

发现一台实例突然无法启动,版本11.2.0.4

SQL> startup nomount
ORA-27154: post/wait create failed
ORA-27300: OS system dependent operation:semget failed with status: 28
ORA-27301: OS failure message: No space left on device
ORA-27302: failure occurred at: sskgpcreates

看报错信息大概是sskgpcreates创建失败,资源不足。这个主要是在数据库启动时根据process参数设置的值去分配信号资源,如果内核参数设置的不合理,则会导致报错。

在Linux里,信号量主要是用来控制特定进程的利用率。它是一个可共享的资源,分别由P (wait)和V (signal)函数操作,这两个函数分别递减和递增信号量。当进程需要资源时,发出“wait”,信号量递减。当信号量的值为0时,资源不可用,调用进程会spin或阻塞(视情况而定),直到资源可用为止。当进程释放由信号量控制的资源时,它会增加信号量并通知等待的进程。由下面几个参数控制

信号量 描述 最小值
SEMMSL 每个组的最大信号量 128
SEMMNS 系统级别最大信号数
SEMOPM 每个semop调用的最大操作
SEMMNI 最大组数

计算当前最少需要的信号量总数

sum(本机所有实例的process参数值)+本机系统需求+本机其他程序需求
  • 设置semmns大于这个sum值
  • 设置semmsl为256
  • 设置semmni为semmns/semmsl

查看当前系统参数

$ ipcs -ls

------ Semaphore Limits --------
max number of arrays = 128       // SEMMNI
max semaphores per array = 250      // SEMMSL
max semaphores system wide = 32000      // SEMMNS
max ops per semop call = 100        // SEMOP
semaphore max value = 32767

从这个配置来看理论上系统级别的最大信号数为32000,最大组数为128,每个组最大信号量250。

但是ipcs命令显示每个信号量标识符最多可以容纳156个Oracle信号量

$ ipcs
...
------ Semaphore Arrays --------
key        semid      owner      perms      nsems     
0x2ff4a110 5701651    oracle     640        156       
0x2ff4a111 5734420    oracle     640        156       
0x2ff4a112 5767189    oracle     640        156       
0x2ff4a113 5799958    oracle     640        156       
0x2ff4a114 5832727    oracle     640        156       
0x2ff4a115 5865496    oracle     640        156 
...

那么实际上最大可用的信号总量为156 x 128=19968,所以要增加最大可用组数。

[root@whzdb1 ~]# /sbin/sysctl -a | grep sem
kernel.sem = 250    32000   100 128

-- change to:
kernel.sem = 250    32000   100 200

分析一则enq: HW – contention问题

同事在做压力测试的时候,说数据库很慢,于是登上数据库看了下,发现大量的enq: TX - row lock contentionenq: HW - contention事件。

%This     SECONDS        AAS WAIT_CHAIN
------ ---------- ---------- ------------------------------------------------------------------------------------------------------------------------------------------------
  18%         960        1.6 -> ON CPU
   9%         470         .8 -> enq: TX - row lock contention  -> ON CPU
   8%         460         .8 -> enq: HW - contention  -> ON CPU
   8%         410         .7 -> enq: TX - row lock contention  -> enq: HW - contention  -> ON CPU
   5%         260         .4 -> enq: TX - row lock contention  -> enq: HW - contention  -> enq: HW - contention
   4%         220         .4 -> enq: HW - contention  -> enq: HW - contention
   4%         210         .4 -> enq: HW - contention  -> enq: HW - contention  -> ON CPU
   3%         190         .3 -> enq: TX - row lock contention  -> enq: HW - contention  -> enq: HW - contention  -> ON CPU
   3%         180         .3 -> enq: TX - row lock contention  -> enq: TX - row lock contention  -> enq: HW - contention  -> ON CPU
   3%         160         .3 -> enq: TX - row lock contention  -> enq: TX - row lock contention  -> ON CPU
   2%         120         .2 -> enq: HW - contention  -> enq: HW - contention  -> enq: HW - contention
   2%         110         .2 -> enq: TX - row lock contention  -> enq: HW - contention  -> enq: HW - contention  -> enq: HW - contention  -> ON CPU
   2%         100         .2 -> log file parallel write
   2%         100         .2 -> enq: TX - row lock contention  -> enq: HW - contention  -> enq: HW - contention  -> enq: HW - contention  -> enq: HW - contention
   2%          90         .2 -> enq: TX - row lock contention  -> log file sync  -> log file parallel write
   2%          90         .2 -> enq: HW - contention  -> enq: HW - contention  -> enq: HW - contention  -> enq: HW - contention  -> enq: HW - contention  -> enq: HW - contention
   1%          80         .1 -> enq: TX - row lock contention  -> enq: TX - row lock contention  -> enq: HW - contention  -> enq: HW - contention
   1%          70         .1 -> enq: HW - contention  -> enq: HW - contention  -> enq: HW - contention  -> ON CPU

前者一般情况代表应用程序可能存在不合理的情况,导致大量的行锁争用,而后者表示高水位的移动争用,HW锁通常是用于管理超出高水位的段空间分配。段的高水位线是段中已使用空间和未使用空间的边界,如果出现了这个等待事件也就表示使用的数据达到了高水位线,需要将高水位线提高来满足额外的数据插入,通常是频繁的数据扩展造成的。

通过v$session_wait查出P3值等

select P2TEXT,p2, p3 from v$session_wait where event = 'enq: HW - contention';

P2TEXT                       P2         P3
-------------------- ---------- ----------
table space                   6   21051234

可以看出是表空间的自动扩展导致,然后用于分析是什么对象

select DBMS_UTILITY.DATA_BLOCK_ADDRESS_FILE(21051234) FILE#,
       DBMS_UTILITY.DATA_BLOCK_ADDRESS_BLOCK(21051234) BLOCK#
  from dual;

     FILE#     BLOCK#
---------- ----------
         5      79714


select owner, segment_type, segment_name
  from dba_extents
 where file_id = 5
   and 79714 between block_id and block_id + blocks - 1
   and tablespace_name =
       (select name from ts$ where ts# = 6);

OWNER                          SEGMENT_TYPE       SEGMENT_NAME
------------------------------ ------------------ ---------------------------------------------------------------------------------
XXX                            TABLE              XXX_T

此外,如果这个等待事件是正在发生的,可以直接查看到使用的段

select DBMS_UTILITY.DATA_BLOCK_ADDRESS_FILE(ID2) FILE#,
       DBMS_UTILITY.DATA_BLOCK_ADDRESS_BLOCK(ID2) BLOCK#
  from v$lock
 where type = 'HW';

所以为了解决这个争用,直接调大increment_by的值,或者直接调成非自动扩展模式,即可大幅缓解这个问题。

除了今天碰到的这种情况外,引起enq HW - contention事件还有可能是其他的原因

  • IO性能问题
  • 一个包含繁忙lob段的表需要满足平均分配DML操作到各个分区的分区方式
  • lob段频繁的收缩扩展

针对最后一种情况也有一些常规的解决办法:

  • 手动给lob段增加一些额外的空间,这样lob段就不需要频繁的回收块
ALTER TABLE <lob_table> MODIFY LOB (<column_name>) (allocate extent (size <extent size>));
  • 通过shrink space命令和dbms_redefinition来对空间进行回收
ALTER TABLE test_lob MODIFY LOB (image) (SHRINK SPACE);

ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below

在做一个rman异机恢复完成后,数据库无法打开,报错提示

Oracle Error: 
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+DG/xxx/datafile/system01.dbf'

用的源库一个rman 0级备份以及后来的部分归档日志,拷贝到这个测试环境,恢复完controlfile之后,进行了基于scn的recover,scn则是restore之后控制文件里显示的scn号

recover database until scn 75238614094;

尝试open时报错

16:54:07 sys. >alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+DG/xxx/datafile/system01.dbf'

但是检查scn均一致,有点奇怪,因为一般来说如果这三个视图的checkpoint_change#一致的话,数据库是应该可以正常打开的。

sys >select (select distinct checkpoint_change# from v$datafile) dfile_ckp,(select checkpoint_change# from v$database) db_ckp,(select distinct checkpoint_change# from v$datafile_header) dh_ckp from dual;

     DFILE_CKP         DB_CKP         DH_CKP
------------------ ------------------ ------------------
       75238614094    75238614094        75238614094

检测数据文件的状态

set numwidth 30;
set pagesize 50000;
alter session set nls_date_format = 'DD-MON-RRRR HH24:MI:SS';

sys.>select status,checkpoint_change#,checkpoint_time, resetlogs_change#, resetlogs_time, count(*), fuzzy from v$datafile_header group by status,checkpoint_change#,checkpoint_time, resetlogs_change#, resetlogs_time, fuzzy;

STATUS  CHECKPOINT_CHANGE# CHECKPOINT_TIME          RESETLOGS_CHANGE# RESETLOGS_TIME                COUNT(*) FUZ
------- ------------------ ----------------------- ------------------------------ ----------------------- ------------------------------ ---
ONLINE         75238614094 23-FEB-2019 00:15:46               67622732054 02-NOV-2018 22:00:36                    45 NO
ONLINE         75238614094 23-FEB-2019 00:15:46               67622732054 02-NOV-2018 22:00:36                     8 YES       <<<<====有8个文件的状态不一致

fuzzy值为yes表示数据文件在checkpoint以后仍然有些写入的动作,比如有一些比存储在v$datafile_header.checkpoint_change#字段里scn更高的scn事务对数据文件进行了修改操作,所以为了让数据文件保持一致性,则需要前滚应用日志

查看需要哪些归档日志来恢复数据文件

select min(fhrba_Seq), max(fhrba_Seq) from X$KCVFH;

        MIN(FHRBA_SEQ)             MAX(FHRBA_SEQ)
------------------------------ ------------------------------
             21201              21201

这个查询结果表示所需要的最小和最大日志文件,为了让数据文件保持一致,你必须要应用这之间所有的日志。

查看当前使用的controlfile

select name, controlfile_type from v$database ;

NAME                   CONTROL
------------------------------ -------
XXX                BACKUP               <<<<====使用恢复出来的控制文件

查看当前redo log

select     substr(L.GROUP#,1,6)       GROUP#
        ,substr(L.THREAD#,1,7)         THREAD#
        ,substr(L.SEQUENCE#,1,10)      SEQUENCE#
        ,substr(L.MEMBERS,1,7)         MEMBERS
        ,substr(L.ARCHIVED,1,8)        ARCHIVED
        ,substr(L.STATUS,1,10)         STATUS
        ,substr(L.FIRST_CHANGE#,1,16)  FIRST_CHANGE#
        ,substr(LF.member,1,60)        REDO_LOGFILE
  from GV$LOG L, GV$LOGFILE LF
 where L.GROUP# = LF.GROUP# ;

GROUP#       THREAD#        SEQUENCE#        MEMBERS    ARCHIV STATUS           FIRST_CHANGE#            REDO_LOGFILE
------------ -------------- -------------------- -------------- ------ -------------------- -------------------------------- ----------------------------------
1        1          21201        1      YES    ACTIVE           75238614094              /home/oradata/xxx/redo01.log                
2        1          21199        1      YES    INACTIVE         75238587795              /home/oradata/xxx/redo02.log
11       1          21203        1      NO     CURRENT          75238632711              /home/oradata/xxx/redo11.log               <<<<====这是当前日志,包含最近的redo
10       1          21202        1      YES    ACTIVE           75238624057              /home/oradata/xxx/redo10.log
3        1          21200        1      YES    INACTIVE         75238601713              /home/oradata/xxx/redo03.log

通过前面查询的数据文件头的最小checkpoint_change#,用于查询所要从哪个日志开始恢复,最小的日志号一直到current redo中间的所有日志都必须可用,才能保证恢复成功。

 select thread#, sequence#, substr(name,1,80) from v$Archived_log
  where 75238614094 between first_change# and next_change#;


               THREAD#              SEQUENCE# SUBSTR(NAME,1,80)
------------------------------ ------------------------------ ----------------------------------------------------------------------
                 1              21200 xxxx
                 1              21200

根据以上结果表示需要恢复21200到21203之间的日志,我们当前已经到了21201,所以总共需要恢复21201和21202 2个文件。

将备份集中的归档日志恢复出来

RMAN> RESTORE ARCHIVELOG FROM SEQUENCE 21201 UNTIL SEQUENCE 21202;

channel ORA_DISK_1: starting archived log restore to default destination
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=21201
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=21202
channel ORA_DISK_1: reading from backup piece /bak/backup/arch_u8tqihpn_1_1
channel ORA_DISK_1: piece handle=/bak/backup/arch_u8tqihpn_1_1 
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:10:35

这里先处理控制文件里的redo路径问题,原路径是文件系统,新环境是asm,所有对所有的redo日志进行了rename

select * from v$logfile;

    GROUP# STATUS  TYPE    MEMBER                                 IS_
---------- ------- ------- ---------------------------------------------------------------------- ---
     3     ONLINE  +REDO/xxx/redo03.log                           NO
     2     ONLINE  +REDO/xxx/redo02.log                           NO
     1     ONLINE  +REDO/xxx/redo01.log                           NO
     4     STANDBY +REDO/xxx/stdredo01.log                        NO
     5     STANDBY +REDO/xxx/stdredo02.log                        NO
     6     STANDBY +REDO/xxx/stdredo03.log                        NO
     7     STANDBY +REDO/xxx/stdredo04.log                        NO
    10     ONLINE  +REDO/xxx/redo10.log                           NO
    11     ONLINE  +REDO/xxx/redo11.log                           NO

ALTER DATABASE RENAME FILE '/home/oradata/xxx/redo03.log' to '+REDO/xxx/redo03.log';
ALTER DATABASE RENAME FILE '/home/oradata/xxx/redo02.log' to '+REDO/xxx/redo02.log';
ALTER DATABASE RENAME FILE '/home/oradata/xxx/redo01.log' to '+REDO/xxx/redo01.log';

ALTER DATABASE RENAME FILE '/home/oradata/xxx/stdredo01.log' to '+REDO/xxx/stdredo01.log';
ALTER DATABASE RENAME FILE '/home/oradata/xxx/stdredo02.log' to '+REDO/xxx/stdredo02.log';
ALTER DATABASE RENAME FILE '/home/oradata/xxx/stdredo03.log' to '+REDO/xxx/stdredo03.log';
ALTER DATABASE RENAME FILE '/home/oradata/xxx/stdredo04.log' to '+REDO/xxx/stdredo04.log';
ALTER DATABASE RENAME FILE '/home/oradata/xxx/redo10.log' to '+REDO/xxx/redo10.log';
ALTER DATABASE RENAME FILE '/home/oradata/xxx/redo11.log' to '+REDO/xxx/redo11.log';

当使用的是备份控制文件时,可以使用基于CANCEL的恢复

22:52:18 sys. xxx>RECOVER DATABASE USING BACKUP CONTROLFILE UNTIL CANCEL ;
ORA-00279: change 75238614094 generated at 02/23/2019 00:15:46 needed for thread 1
ORA-00289: suggestion : +NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21201.664.1001796429
ORA-00280: change 75238614094 for thread 1 is in sequence #21201


22:52:35 Spexxxy log: {<RET>=suggested | filename | AUTO | CANCEL}
+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21201.664.1001796429
ORA-00279: change 75238624057 generated at 02/23/2019 00:19:25 needed for thread 1
ORA-00289: suggestion : +NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21202.663.1001796179
ORA-00280: change 75238624057 for thread 1 is in sequence #21202
ORA-00278: log file '+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21201.664.1001796429' no longer needed for this recovery


22:52:49 Spexxxy log: {<RET>=suggested | filename | AUTO | CANCEL}
+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21202.663.1001796179
ORA-00279: change 75238632711 generated at 02/23/2019 00:22:37 needed for thread 1
ORA-00289: suggestion : +nvmedg
ORA-00280: change 75238632711 for thread 1 is in sequence #21203
ORA-00278: log file '+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21202.663.1001796179' no longer needed for this recovery


22:53:03 Spexxxy log: {<RET>=suggested | filename | AUTO | CANCEL}        <<<<====缺少进一步的日志用于恢复
auto
ORA-00308: cannot open archived log '+nvmedg'
ORA-17503: ksfdopn:2 Failed to open file +nvmedg
ORA-15045: ASM file name '+nvmedg' is not in reference form


ORA-00308: cannot open archived log '+nvmedg'
ORA-17503: ksfdopn:2 Failed to open file +nvmedg
ORA-15045: ASM file name '+nvmedg' is not in reference form


ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+NVMEDG/xxx/datafile/system01.dbf'

当我恢复完21201和21202两个文件以后,发现依然无法打开数据库。

sys. >alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+DG/xxx/datafile/system01.dbf'

那就表示目前数据文件仍然有不一致的地方,最后需要的部分则在current redo当中,因为我这个是测试环境,所以虽然没有current redo,但是我有21203号之后的归档日志,所以直接采用这些归档日志来进行recover,那具体要恢复到哪个scn呢,可以通过下面语句查询出来

select  min(FHSCN) "LOW FILEHDR SCN"
           , max(FHSCN) "MAX FILEHDR SCN"
           , max(FHAFS) "Min PITR ABSSCN"
        from X$KCVFH ;

LOW FILEHDR SCN  MAX FILEHDR SCN  Min PITR ABSSCN
---------------- ---------------- ----------------
75238632711  75238632711      75238770594

“LOW FILEHDR SCN” – 恢复进程开始的scn
“MAX FILEHDR SCN” – 为了使所有数据文件保持一致必须要恢复到的scn
IF “Min PITR ABSSCN” != 0 AND > “MAX FILEHDR SCN”
THEN “Min PITR ABSSCN” 则是为了使所有数据文件保持一致必须要恢复到的scn

所以根据上述原则必须要恢复到75238770594

RMAN> recover database until scn 75238770594;

Starting recover at 2019-03-01 23:16:30
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=2065 device type=DISK

starting media recovery

channel ORA_DISK_1: starting archived log restore to default destination
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=21203
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=21204
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=21205
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=21206
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=21207
channel ORA_DISK_1: restoring archived log
archived log thread=1 sequence=21208
channel ORA_DISK_1: reading from backup piece /bak/backup/arch_u8tqihpn_1_1
channel ORA_DISK_1: piece handle=/bak/backup/arch_u8tqihpn_1_1
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:13:25
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21203.677.1001805529 thread=1 sequence=21203
channel default: deleting archived log(s)
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21203.677.1001805529 RECID=66515 STAMP=1001805655
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21204.678.1001805657 thread=1 sequence=21204
channel default: deleting archived log(s)
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21204.678.1001805657 RECID=66516 STAMP=1001805790
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21205.679.1001806059 thread=1 sequence=21205
channel default: deleting archived log(s)
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21205.679.1001806059 RECID=66517 STAMP=1001806190
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21206.675.1001805395 thread=1 sequence=21206
channel default: deleting archived log(s)
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21206.675.1001805395 RECID=66513 STAMP=1001805528
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21207.676.1001805395 thread=1 sequence=21207
channel default: deleting archived log(s)
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21207.676.1001805395 RECID=66512 STAMP=1001805524
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21208.674.1001805393 thread=1 sequence=21208
channel default: deleting archived log(s)
archived log file name=+NVMEDG/xxx/archivelog/2019_03_01/thread_1_seq_21208.674.1001805393 RECID=66514 STAMP=1001805536
media recovery complete, elapsed time: 00:00:17
Finished recover at 2019-03-01 23:30:16

这次的恢复就正常没有报错,检查数据文件的scn和状态等

sys >select (select distinct checkpoint_change# from v$datafile) dfile_ckp,(select checkpoint_change# from v$database) db_ckp,(select distinct checkpoint_change# from v$datafile_header) dh_ckp from dual;

     DFILE_CKP         DB_CKP         DH_CKP
------------------ ------------------ ------------------
       75238770595    75238770595        75238770595

set numwidth 30;
set pagesize 50000;
alter session set nls_date_format = 'DD-MON-RRRR HH24:MI:SS';

select status,checkpoint_change#,checkpoint_time, resetlogs_change#, resetlogs_time, count(*), fuzzy from v$datafile_header group by status,checkpoint_change#,checkpoint_time, resetlogs_change#, resetlogs_time, fuzzy;

STATUS          CHECKPOINT_CHANGE# CHECKPOINT_TIME              RESETLOGS_CHANGE# RESETLOGS_TIME                    COUNT(*) FUZ
------- ------------------------------ ----------------------- ------------------------------ ----------------------- ------------------------------ ---
ONLINE             75238770595 01-MAR-2019 23:36:12           75238770595 01-MAR-2019 23:36:12                53 NO

尝试打开数据库

alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01139: RESETLOGS option only valid after an incomplete database recovery


alter database open;
alter database open
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 97239
Session ID: 1921 Serial number: 1

重建控制文件

CREATE CONTROLFILE REUSE DATABASE "XXX" NORESETLOGS  ARCHIVELOG
......省略

select name, controlfile_type from v$database ;

NAME                   CONTROLFILE_TY
------------------------------ --------------
XXX                CREATED

alter database open;

Database altered.

至此恢复完毕。