NBU备份错误的示例分析

77次阅读
没有评论

共计 9822 个字符,预计需要花费 25 分钟才能阅读完成。

这篇文章将为大家详细讲解有关 NBU 备份错误的示例分析,丸趣 TV 小编觉得挺实用的,因此分享给大家做个参考,希望大家阅读完这篇文章后可以有所收获。

在对系统进行例行检查的时候,发现日常备份失败。   

错误信息为:
 

RMAN backup incremental level 0 database;
 

Starting backup at 10-MAR-08
using target database controlfile instead of recovery catalog
allocated channel: ORA_SBT_TAPE_1
channel ORA_SBT_TAPE_1: sid=120 devtype=SBT_TAPE
channel ORA_SBT_TAPE_1: VERITAS NetBackup for Oracle – Release 5.0GA (2003103006)
channel ORA_SBT_TAPE_1: starting incremental level 0 datafile backupset
channel ORA_SBT_TAPE_1: specifying datafile(s) in backupset
input datafile fno=00001 name=/dev/vx/rdsk/maindbdg/lv_main00
input datafile fno=00008 name=/opt/oracle/oradata/oradata/bjdb01/users01.dbf
input datafile fno=00039 name=/opt/oracle/oradata/oradata/bjdb01/xdb02.dbf
input datafile fno=00009 name=/opt/oracle/oradata/oradata/bjdb01/xdb01.dbf
input datafile fno=00003 name=/opt/oracle/oradata/oradata/bjdb01/cwmlite01.dbf
input datafile fno=00004 name=/opt/oracle/oradata/oradata/bjdb01/drsys01.dbf
input datafile fno=00006 name=/opt/oracle/oradata/oradata/bjdb01/odm01.dbf
input datafile fno=00007 name=/opt/oracle/oradata/oradata/bjdb01/tools01.dbf
channel ORA_SBT_TAPE_1: starting piece 1 at 10-MAR-08
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on ORA_SBT_TAPE_1 channel at 03/10/2008 11:31:12
ORA-19506: failed to create sequential file, name= tpjatl1b_1_1 , parms=
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
VxBSACreateObject: Failed with error:
Server Status: unable to allocate new media for backup, storage unit has none available
 

从这个错误信息上看似乎是空间不足造成的。不过虽然的备份错误信息变为:
 

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on ch00 channel at 03/10/2008 05:14:15
ORA-19502: write error on file bk_26552_1_648968690 , blockno 664577 (blocksize=512)
ORA-27030: skgfwrt: sbtwrite2 returned error
ORA-19511: Error received from media manager layer, error text:
VxBSASendData: Failed with error:
Server Status: Communication with the server has not been iniatated or the server status has not been retrieved from the server.
 

从这个错误上看,就不只是空间的问题了。
 

通过图形界面 jnbSA,发现很多管理选项点击后反应很慢,基本上出不来结果。于是采用 bpadm 从命令行方式进行查询,从 REPORT 的 PROBLEM 中查询到下面的信息:
 

03/11/2008 01:45:04 backupcenter240 bpexpdate Could not build host list: client hostname could not be found
03/11/2008 02:13:34 backupcenter240 bjdb01 cannot write p_w_picpath to media id 000013, drive index 0, I/O
  错误
03/11/2008 02:13:48 backupcenter240 bjdb01 backup by oracle on client bjdb01 using policy oracle: media write error
03/11/2008 02:14:04 backupcenter240 bjdb01 backup of client bjdb01 exited with status 6 (the backup failed to back up the requested files)
03/11/2008 02:22:58 backupcenter240 bjdb01 cannot write p_w_picpath to media id 000013, drive index 0, I/ O 错误
03/11/2008 02:23:12 backupcenter240 bjdb01 backup by oracle on client bjdb01 using policy oracle: media write error
03/11/2008 02:23:19 backupcenter240 bjdb01 suspending further backup attempts for client bjdb01, policy oracle, schedule Cumulative-Inc because it has exceeded the configured number of tries
03/11/2008 02:23:19 backupcenter240 bjdb01 backup of client bjdb01 exited with status 6 (the backup failed to back up the requested files)
03/11/2008 02:23:20 backupcenter240 – scheduler exiting – the backup failed to back up the requested files (6)
03/11/2008 09:32:42 backupcenter240 data03 cannot write p_w_picpath to media id 000016, drive index 0, I/ O 错误
03/11/2008 09:32:53 backupcenter240 data03 DOWN ing drive index 0, it has had at least 3 errors in last 12 hour(s)
03/11/2008 09:32:55 backupcenter240 data03 backup by oracle on client data03 using policy bjdb03-ora: media write error
03/11/2008 09:33:02 backupcenter240 data03 backup of client data03 exited with status 6 (the backup failed to back up the requested files)
03/11/2008 10:48:34 backupcenter240 data03 media manager terminated during mount of media id 000016, possible media mount timeout
03/11/2008 10:48:36 backupcenter240 data03 media manager terminated by parent process
03/11/2008 10:48:37 backupcenter240 data03 backup by oracle on client data03 using policy bjdb03-ora: the backup failed to back up the requested files
03/11/2008 10:48:38 backupcenter240 data03 suspending further backup attempts for client data03, policy bjdb03-ora, schedule diff because it has exceeded the configured number of tries
03/11/2008 10:48:38 backupcenter240 data03 backup of client data03 exited with status 6 (the backup failed to back up the requested files)
03/11/2008 13:55:03 backupcenter240 bpexpdate Could not build host list: client hostname could not be found
 

进一步查询详细的 log 信息,发现存在大量的错误:
 

03/11/2008 18:23:59 backupcenter240 – cleaning job DB
03/11/2008 18:23:59 backupcenter240 – all drives are down for the specified robot number = 0, robot type = TLD and density = hcart
03/11/2008 18:23:59 backupcenter240 – no drives up on storage unit backupcenter240-hcart-robot-tld-0
03/11/2008 18:24:00 bjdb01 – all drives are down for the specified robot number = 0, robot type = TLD and density = hcart
03/11/2008 18:24:00 backupcenter240 – no drives up on storage unit bjdb01-hcart-robot-tld-0
03/11/2008 18:24:31 backupcenter240 – all drives are down for the specified robot number = 0, robot type = TLD and density = hcart
03/11/2008 18:24:31 backupcenter240 – no drives up on storage unit unit_99
03/11/2008 18:24:32 backupcenter240 – all drives are down for the specified robot number = 0, robot type = TLD and density = hcart
03/11/2008 18:24:32 backupcenter240 – no drives up on storage unit unit_data
03/11/2008 18:24:32 backupcenter240 data03 skipping backup of client data03, policy bjdb03-ora, schedule diff because it has exceeded the configured number of tries
 

从这个信息上看,似乎是机械手出现了问题。而且如果真的是机械手的问题,那么也可以解释前后两次备份错误信息的不同。当一个磁带备份满了之后,机械手尝试更换新的磁带,这时出现了故障,而对于当时备份的操作,就出现了无法写入的错误,报错没有足够空间。而随后的备份由于机械手故障,而导致没有可用的磁带可以写入,因此报错 NETBACKUP 没有初始化完成。
 

继续检查 media 的报告,在汇总信息中看到:
 

Number of ACTIVE media that, as of now:
There are no ACTIVE media present in the media database
 

这进一步确定了刚才的判断,机械手故障导致可用的磁带无法放到驱动器中,因此系统中没有可用的介质。
 

通过 tpconfig 检查机械手的状态:
 

Index DriveName DrivePath Type Shared Status
***** ********* ********** **** ****** ******
0 IBMULTRIUM-TD10 /dev/rmt/1cbn hcart Yes DOWN
TLD(0) Definition DRIVE=1
 

Currently defined robotics are:
TLD(0) robotic path = /dev/sg/c2t4l1,
volume database host = backupcenter240
 

机械手处于 DOWN 的状态,看来问题已经基本确定了。
 

尝试使用 robtest 检查机械手:
 

bash-2.03# robtest
Configured robots with local control supporting test utilities:
TLD(0) robotic path = /dev/sg/c2t4l1
 

Robot Selection
—————
1) TLD 0
2) none/quit
Enter choice: 1
 

Robot selected: TLD(0) robotic path = /dev/sg/c2t4l1
 

Invoking robotic test utility:
/usr/openv/volmgr/bin/tldtest -r /dev/sg/c2t4l1 -d1 /dev/rmt/1cbn
 

Opening /dev/sg/c2t4l1
MODE_SENSE complete
Enter tld commands (? returns help information)
?
 

To exit the utility, type q or Q.
 

init – Initialize element status
initrange d#|s#|p#|t [#]- Init element status range
allow – Allow media removal
prevent – Prevent media removal
extend – Extend media access port
retract – Retract media access port
mode – Mode sense
m from to – Move medium
pos to – Position to drive or slot
s [d|p|t|s [n]] [raw] – Read element status
inquiry – Display vendor and product ID
rezero – Rezero unit
inport – Ready inport (media access port)
debug – Toggle debug mode for this utility
test_ready – Send a TEST UNIT READY to the device
 

from to specifies drive (d#), slot (s#), media access port (p#),
or transport (t#)
d#|s#|p#|t# is drive #, slot #, media access port #, or transport #
[#] is number of elements for d, s, p, or t
NOTE – drive # is 1 – Number of drives
slot # is 1 – Number of slots
media access port # is 1 – Number of media access port elements
transport # is 1 – Number of transports
type = (d)rive, (s)lot, media access (p)ort, or (t)ransport
 

unload drive – Issue SCSI unload
drive = d1 or 1, d2 or 2, d3 or 3 … d648 or 648
 

inquiry
Inquiry_data: STK L40 0213
test_ready
Unit is ready
q
 

Robot Selection
—————
1) TLD 0
2) none/quit
Enter choice:
 

尝试发出 test_ready 命令,等待一段时间后,发现机械手状态已经恢复正常:
 

Index DriveName DrivePath Type Shared Status
***** ********* ********** **** ****** ******
0 IBMULTRIUM-TD10 /dev/rmt/1cbn hcart Yes UP
TLD(0) Definition DRIVE=1
 

Currently defined robotics are:
TLD(0) robotic path = /dev/sg/c2t4l1,
volume database host = backupcenter240
 

下面尝试备份:
 

$ rman target /
 

Recovery Manager: Release 9.2.0.4.0 – 64bit Production
 

Copyright (c) 1995, 2002, Oracle Corporation. All rights reserved.
 

connected to target database: BJDB01 (DBID=3255963758)
 

RMAN backup current controlfile;
 

Starting backup at 11-MAR-08
using target database controlfile instead of recovery catalog
allocated channel: ORA_SBT_TAPE_1
channel ORA_SBT_TAPE_1: sid=19 devtype=SBT_TAPE
channel ORA_SBT_TAPE_1: VERITAS NetBackup for Oracle – Release 5.0GA (2003103006)
channel ORA_SBT_TAPE_1: starting full datafile backupset
channel ORA_SBT_TAPE_1: specifying datafile(s) in backupset
including current controlfile in backupset
channel ORA_SBT_TAPE_1: starting piece 1 at 11-MAR-08
channel ORA_SBT_TAPE_1: finished piece 1 at 11-MAR-08
piece handle=ttjb17ur_1_1 comment=API Version 2.0,MMS Version 5.0.0.0
channel ORA_SBT_TAPE_1: backup set complete, elapsed time: 00:04:56
Finished backup at 11-MAR-08
 

Starting Control File Autobackup at 11-MAR-08
piece handle=c-3255963758-20080311-00 comment=API Version 2.0,MMS Version 5.0.0.0
Finished Control File Autobackup at 11-MAR-08
 

尝试备份终于成功。
 

可惜的是,备份小的文件似乎没有问题,一旦备份文件比较大的时候,仍然出现上面的错误信息:
 

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on ch00 channel at 03/10/2008 05:14:15
ORA-19502: write error on file bk_26552_1_648968690 , blockno 664577 (blocksize=512)
ORA-27030: skgfwrt: sbtwrite2 returned error
ORA-19511: Error received from media manager layer, error text:
VxBSASendData: Failed with error:
Server Status: Communication with the server has not been iniatated or the server status has not been retrieved from the server.
 

而且后台日志出现大量的 IO 错误信息:
 

03/12/2008 09:42:51 backupcenter240 bjdb01 cannot write p_w_picpath to media id 000016, drive index 0, I/O
  错误
03/12/2008 09:42:51 backupcenter240 bjdb01 FREEZING media id 000016, it has had at least 3 errors in the last 12 hour(s)
03/12/2008 09:43:08 backupcenter240 bjdb01 CLIENT bjdb01 POLICY oracle SCHED Default-Application-Backup EXIT STATUS 84 (media write error)
03/12/2008 09:43:08 backupcenter240 bjdb01 backup by oracle on client bjdb01: media write error
 

看来现在不仅仅是软件问题了,经过供应商最后确认,是带库的读写头出现问题,最终通过更换配件,解决了这个问题。
 

关于“NBU 备份错误的示例分析”这篇文章就分享到这里了,希望以上内容可以对大家有一定的帮助,使各位可以学到更多知识,如果觉得文章不错,请把它分享出去让更多的人看到。

正文完
 
丸趣
版权声明:本站原创文章,由 丸趣 2023-08-25发表,共计9822字。
转载说明:除特殊说明外本站除技术相关以外文章皆由网络搜集发布,转载请注明出处。
评论(没有评论)