如何理解MYSQL

133次阅读

共计 23424 个字符，预计需要花费 59 分钟才能阅读完成。

这篇文章将为大家详细讲解有关如何理解 MYSQL-GroupCommit 和 2pc 提交，文章内容质量较高，因此丸趣 TV 小编分享给大家做个参考，希望大家阅读完这篇文章后对相关知识有一定的了解。

组提交 (group commit) 是 MYSQL 处理日志的一种优化方式，主要为了解决写日志时频繁刷磁盘的问题。组提交伴随着 MYSQL 的发展不断优化，从最初只支持 redo log 组提交，到目前 5.6 官方版本同时支持 redo log 和 binlog 组提交。组提交的实现大大提高了 mysql 的事务处理性能，将以 innodb 存储引擎为例，详细介绍组提交在各个阶段的实现原理。

redo log 的组提交

WAL(Write-Ahead-Logging)是实现事务持久性的一个常用技术，基本原理是在提交事务时，为了避免磁盘页面的随机写，只需要保证事务的 redo log 写入磁盘即可，这样可以通过 redo log 的顺序写代替页面的随机写，并且可以保证事务的持久性，提高了数据库系统的性能。虽然 WAL 使用顺序写替代了随机写，但是，每次事务提交，仍然需要有一次日志刷盘动作，受限于磁盘 IO，这个操作仍然是事务并发的瓶颈。

组提交思想是，将多个事务 redo log 的刷盘动作合并，减少磁盘顺序写。Innodb 的日志系统里面，每条 redo log 都有一个 LSN(Log Sequence Number)，LSN 是单调递增的。每个事务执行更新操作都会包含一条或多条 redo log，各个事务将日志拷贝到 log_sys_buffer 时(log_sys_buffer 通过 log_mutex

保护)，都会获取当前最大的 LSN，因此可以保证不同事务的 LSN 不会重复。那么假设三个事务 Trx1,Trx2 和 Trx3 的日志的最大 LSN 分别为 LSN1,LSN2,LSN3(LSN1 lsn2 lsn3)，它们同时进行提交，那么如果 trx3 日志先获取到 log_mutex 进行落盘，它就可以顺便把 [lsn1—lsn3] 这段日志也刷了，这样 trx1 和 trx2 就不用再次请求磁盘 io。组提交的基本流程如下：/lsn2 lsn3)，它们同时进行提交，那么如果 trx3 日志先获取到 log_mutex 进行落盘，它就可以顺便把 [lsn1—lsn3] 这段日志也刷了，这样 trx1 和 trx2 就不用再次请求磁盘 io。组提交的基本流程如下：

获取 log_mutex

若 flushed_to_disk_lsn =lsn，表示日志已经被刷盘, 跳转 5

若 current_flush_lsn =lsn，表示日志正在刷盘中，跳转 5 后进入等待状态

将小于 LSN 的日志刷盘(flush and sync)

退出 log_mutex

备注：lsn 表示事务的 lsn，flushed_to_disk_lsn 和 current_flush_lsn 分别表示已刷盘的 LSN 和正在刷盘的 LSN。

redo log 组提交优化

我们知道，在开启 binlog 的情况下，prepare 阶段，会对 redo log 进行一次刷盘操作(innodb_flush_log_at_trx_commit=1)，确保对 data 页和 undo 页的更新已经刷新到磁盘；commit 阶段，会进行刷 binlog 操作(sync_binlog=1), 并且会对事务的 undo log 从 prepare 状态设置为提交状态(可清理状态)。通过两阶段提交方式(innodb_support_xa=1)，可以保证事务的 binlog 和 redo log 顺序一致。二阶段提交过程中，mysql_binlog 作为协调者，各个存储引擎和 mysql_binlog 作为参与者。故障恢复时，扫描最后一个 binlog 文件(进行 rotate binlog 文件时，确保老的 binlog 文件对应的事务已经提交)，提取其中的 xid；重做检查点以后的 redo 日志，读取事务的 undo 段信息，搜集处于 prepare 阶段的事务链表，将事务的 xid 与 binlog 中的 xid 对比，若存在，则提交，否则就回滚。

通过上述的描述可知，每个事务提交时，都会触发一次 redo flush 动作，由于磁盘读写比较慢，因此很影响系统的吞吐量。淘宝童鞋做了一个优化，将 prepare 阶段的刷 redo 动作移到了 commit(flush-sync-commit)的 flush 阶段之前，保证刷 binlog 之前，一定会刷 redo。这样就不会违背原有的故障恢复逻辑。移到 commit 阶段的好处是，可以不用每个事务都刷盘，而是 leader 线程帮助刷一批 redo。如何实现，很简单，因为 log_sys- lsn 始终保持了当前最大的 lsn，只要我们刷 redo 刷到当前的 log_sys- lsn，就一定能保证，将要刷 binlog 的事务 redo 日志一定已经落盘。通过延迟写 redo 方式，实现了 redo log 组提交的目的，而且减少了 log_sys- mutex 的竞争。目前这种策略已经被官方 mysql5.7.6 引入。

两阶段提交

在单机情况下，redo log 组提交很好地解决了日志落盘问题，那么开启 binlog 后，binlog 能否和 redo log 一样也开启组提交？首先开启 binlog 后，我们要解决的一个问题是，如何保证 binlog 和 redo log 的一致性。因为 binlog 是 Master-Slave 的桥梁，如果顺序不一致，意味着 Master-Slave 可能不一致。MYSQL 通过两阶段提交很好地解决了这一问题。Prepare 阶段，innodb 刷 redo log，并将回滚段设置为 Prepared 状态，binlog 不作任何操作；commit 阶段，innodb 释放锁，释放回滚段，设置提交状态，binlog 刷 binlog 日志。出现异常，需要故障恢复时，若发现事务处于 Prepare 阶段，并且 binlog 存在则提交，否则回滚。通过两阶段提交，保证了 redo log 和 binlog 在任何情况下的一致性。

binlog 的组提交

回到上节的问题，开启 binlog 后，如何在保证 redo log-binlog 一致的基础上，实现组提交。因为这个问题，5.6 以前，mysql 在开启 binlog 的情况下，无法实现组提交，通过一个臭名昭著的 prepare_commit_mutex，将 redo log 和 binlog 刷盘串行化，串行化的目的也仅仅是为了保证 redo log-Binlog 一致，但这种实现方式牺牲了性能。这个情况显然是不能容忍的，因此各个 mysql 分支，mariadb，facebook，perconal 等相继出了补丁改进这一问题，mysql 官方版本 5.6 也终于解决了这一问题。由于各个分支版本解决方法类似，我主要通过分析 5.6 的实现来说明实现方法。

binlog 组提交的基本思想是，引入队列机制保证 innodb commit 顺序与 binlog 落盘顺序一致，并将事务分组，组内的 binlog 刷盘动作交给一个事务进行，实现组提交目的。binlog 提交将提交分为了 3 个阶段，FLUSH 阶段，SYNC 阶段和 COMMIT 阶段。每个阶段都有一个队列，每个队列有一个 mutex 保护，约定进入队列第一个线程为 leader，其他线程为 follower，所有事情交由 leader 去做，leader 做完所有动作后，通知 follower 刷盘结束。binlog 组提交基本流程如下：

FLUSH 阶段

1) 持有 Lock_log mutex [leader 持有，follower 等待]

2) 获取队列中的一组 binlog(队列中的所有事务)

3) 将 binlog buffer 到 I /O cache

4) 通知 dump 线程 dump binlog

SYNC 阶段

1) 释放 Lock_log mutex，持有 Lock_sync mutex[leader 持有，follower 等待]

2) 将一组 binlog 落盘(sync 动作，最耗时，假设 sync_binlog 为 1)

COMMIT 阶段

1) 释放 Lock_sync mutex，持有 Lock_commit mutex[leader 持有，follower 等待]

2) 遍历队列中的事务，逐一进行 innodb commit

3) 释放 Lock_commit mutex

4) 唤醒队列中等待的线程

说明：由于有多个队列，每个队列各自有 mutex 保护，队列之间是顺序的，约定进入队列的一个线程为 leader，因此 FLUSH 阶段的 leader 可能是 SYNC 阶段的 follower，但是 follower 永远是 follower。

通过上文分析，我们知道 MYSQL 目前的组提交方式解决了一致性和性能的问题。通过二阶段提交解决一致性，通过 redo log 和 binlog 的组提交解决磁盘 IO 的性能。下面我整理了 Prepare 阶段和 Commit 阶段的框架图供各位参考。

参考文档

http://mysqlmusings.blogspot.com/2012/06/binary-log-group-commit-in-mysql-56.html

http://www.lupaworld.com/portal.php?mod=view aid=250169 page=all

http://www.oschina.net/question/12_89981

http://kristiannielsen.livejournal.com/12254.html

http://blog.chinaunix.net/uid-26896862-id-3432594.html

http://www.csdn.net/article/2015-01-16/2823591

MySQL 的事务提交逻辑主要在函数 ha_commit_trans 中完成。事务的提交涉及到 binlog 及具体的存储的引擎的事务提交。所以 MySQL 用 2PC 来保证的事务的完整性。MySQL 的 2PC 过程如下：

T@4 : | | | |  trans_commit
T@4 : | | | | | enter: stmt.ha_list: , all.ha_list: T@4 : | | | | | debug: stmt.unsafe_rollback_flags: 
T@4 : | | | | | debug: all.unsafe_rollback_flags: 
T@4 : | | | | |  trans_check
T@4 : | | | | |  trans_check 49 T@4 : | | | | | info: clearing SERVER_STATUS_IN_TRANS
T@4 : | | | | |  ha_commit_trans T@4 : | | | | | | info: all=1 thd- in_sub_stmt=0 ha_info=0x0 is_real_trans=1 T@4 : | | | | | |  MYSQL_BIN_LOG::commit T@4 : | | | | | | | enter: thd: 0x2b9f4c07beb0, all: yes, xid: 0, cache_mngr: 0x0 T@4 : | | | | | | |  ha_commit_low
T@4 : | | | | | | | |  THD::st_transaction::cleanup
T@4 : | | | | | | | | |  free_root
T@4 : | | | | | | | | | | enter: root: 0x2b9f4c07d660 flags: 1 T@4 : | | | | | | | | |  free_root 396 T@4 : | | | | | | | |  thd::st_transaction::cleanup 2521 T@4 : | | | | | | |  ha_commit_low 1535 T@4 : | | | | | |  mysql_bin_log::commit 6383 T@4 : | | | | | |  THD::st_transaction::cleanup
T@4 : | | | | | | |  free_root
T@4 : | | | | | | | | enter: root: 0x2b9f4c07d660 flags: 1 T@4 : | | | | | | |  free_root 396 T@4 : | | | | | |  thd::st_transaction::cleanup 2521 T@4 : | | | | |  ha_commit_trans 1458 T@4 : | | | | | debug: reset_unsafe_rollback_flags
T@4 : | | | |  trans_commit 233 /ha_commit_trans  T@4 : | | | |  MDL_context::release_transactional_locks
T@4 : | | | | |  MDL_context::release_locks_stored_before
T@4 : | | | | |  mdl_context::release_locks_stored_before 2771 T@4 : | | | | |  MDL_context::release_locks_stored_before
T@4 : | | | | |  mdl_context::release_locks_stored_before 2771 T@4 : | | | |  mdl_context::release_transactional_locks 2926 T@4 : | | | |  set_ok_status
T@4 : | | | |  set_ok_status 446 T@4 : | | | | THD::enter_stage: /usr/src/mysql-5.6.28/sql/sql_parse.cc:4996 T@4 : | | | |  PROFILING::status_change
T@4 : | | | |  profiling::status_change 354 T@4 : | | | |  trans_commit_stmt
T@4 : | | | | | enter: stmt.ha_list: , all.ha_list: T@4 : | | | | | enter: stmt.ha_list: , all.ha_list: T@4 : | | | | | debug: stmt.unsafe_rollback_flags: 
T@4 : | | | | | debug: all.unsafe_rollback_flags: 
T@4 : | | | | | debug: add_unsafe_rollback_flags: 0 T@4 : | | | | |  MYSQL_BIN_LOG::commit

(1)先调用 binglog_hton 和 innobase_hton 的 prepare 方法完成第一阶段，binlog_hton 的 papare 方法实际上什么也没做，innodb 的 prepare 将事务状态设为 TRX_PREPARED，并将 redo log 刷磁盘 (innobase_xa_prepare à trx_prepare_for_mysql à trx_prepare_off_kernel)。

(2)如果事务涉及的所有存储引擎的 prepare 都执行成功，则调用 TC_LOG_BINLOG::log_xid 将 SQL 语句写到 binlog，此时，事务已经铁定要提交了。否则，调用 ha_rollback_trans 回滚事务，而 SQL 语句实际上也不会写到 binlog。

(3)最后，调用引擎的 commit 完成事务的提交。实际上 binlog_hton- commit 什么也不会做 (因为(2) 已经将 binlog 写入磁盘)，innobase_hton- commit 则清除 undo 信息，刷 redo 日志，将事务设为 TRX_NOT_STARTED 状态(innobase_commit à innobase_commit_low à trx_commit_for_mysql à trx_commit_off_kernel)。

//ha_innodb.cc

static

int

innobase_commit(

/*============*/

/* out: 0 */

THD* thd, /* in: MySQL thread handle of the user for whom

the transaction should be committed */

bool all) /* in: TRUE – commit transaction

FALSE – the current SQL statement ended */

{

…

trx- mysql_log_file_name = mysql_bin_log.get_log_fname();

trx- mysql_log_offset =

(ib_longlong)mysql_bin_log.get_log_file()- pos_in_file;

…

}

函数 innobase_commit 提交事务，先得到当前的 binlog 的位置，然后再写入事务系统 PAGE(trx_commit_off_kernel à trx_sys_update_mysql_binlog_offset)。

InnoDB 将 MySQL binlog 的位置记录到 trx system header 中：

//trx0sys.h

/* The offset of the MySQL binlog offset info in the trx system header */

#define TRX_SYS_MYSQL_LOG_INFO (UNIV_PAGE_SIZE – 1000)

#define TRX_SYS_MYSQL_LOG_MAGIC_N_FLD 0 /* magic number which shows

if we have valid data in the

MySQL binlog info; the value

is …_MAGIC_N if yes */

#define TRX_SYS_MYSQL_LOG_OFFSET_HIGH 4 /* high 4 bytes of the offset

within that file */

#define TRX_SYS_MYSQL_LOG_OFFSET_LOW 8 /* low 4 bytes of the offset

within that file */

#define TRX_SYS_MYSQL_LOG_NAME 12 /* MySQL log file name */

5.3.2 事务恢复流程

Innodb 在恢复的时候，不同状态的事务，会进行不同的处理(见 trx_rollback_or_clean_all_without_sess 函数)：

1 对于 TRX_COMMITTED_IN_MEMORY 的事务，清除回滚段，然后将事务设为 TRX_NOT_STARTED；

2 对于 TRX_NOT_STARTED 的事务，表示事务已经提交，跳过；

3 对于 TRX_PREPARED 的事务，要根据 binlog 来决定事务的命运，暂时跳过;

4 对于 TRX_ACTIVE 的事务，回滚。

MySQL 在打开 binlog 时，会检查 binlog 的状态(TC_LOG_BINLOG::open)。如果 binlog 没有正常关闭(LOG_EVENT_BINLOG_IN_USE_F 为 1)，则进行恢复操作，基本流程如下：

1 扫描 binlog，读取 XID_EVENT 事务，得到所有已经提交的 XA 事务列表(实际上事务在 innodb 可能处于 prepare 或者 commit)；

2 对每个 XA 事务，调用 handlerton::recover，检查存储引擎是否存在处于 prepare 状态的该事务(见 innobase_xa_recover)，也就是检查该 XA 事务在存储引擎中的状态；

3 如果存在处于 prepare 状态的该 XA 事务，则调用 handlerton::commit_by_xid 提交事务；

4 否则，调用 handlerton::rollback_by_xid 回滚 XA 事务。

5.3.3 几个参数讨论

(1)sync_binlog

Mysql 在提交事务时调用 MYSQL_LOG::write 完成写 binlog，并根据 sync_binlog 决定是否进行刷盘。默认值是 0，即不刷盘，从而把控制权让给 OS。如果设为 1，则每次提交事务，就会进行一次刷盘；这对性能有影响(5.6 已经支持 binlog group)，所以很多人将其设置为 100。

bool MYSQL_LOG::flush_and_sync()

{

int err=0, fd=log_file.file;

safe_mutex_assert_owner(LOCK_log);

if (flush_io_cache( log_file))

return 1;

if (++sync_binlog_counter = sync_binlog_period sync_binlog_period)

{

sync_binlog_counter= 0;

err=my_sync(fd, MYF(MY_WME));

}

return err;

}

(2) innodb_flush_log_at_trx_commit

该参数控制 innodb 在提交事务时刷 redo log 的行为。默认值为 1，即每次提交事务，都进行刷盘操作。为了降低对性能的影响，在很多生产环境设置为 2，甚至 0。

 trx_flush_log_if_needed_low( /*========================*/ lsn_t lsn) /*!  in: lsn up to which logs are to be
 flushed. */ { switch (srv_flush_log_at_trx_commit) { case 0: /* Do nothing */ break; case 1: /* Write the log and optionally flush it to disk */ log_write_up_to(lsn, LOG_WAIT_ONE_GROUP,
 srv_unix_file_flush_method != SRV_UNIX_NOSYNC); break; case 2: /* Write the log but do not flush it to disk */ log_write_up_to(lsn, LOG_WAIT_ONE_GROUP, FALSE); break; default:
 ut_error;
 }
}

If the value of innodb_flush_log_at_trx_commit is 0, the log buffer is written out to the log file once per second and the flush to disk operation is performed on the log file, but nothing is done at a transaction commit. When the value is 1 (the default), the log buffer is written out to the log file at each transaction commit and the flush to disk operation is performed on the log file. When the value is 2, the log buffer is written out to the file at each commit, but the flush to disk operation is not performed on it. However, the flushing on the log file takes place once per second also when the value is 2. Note that the once-per-second flushing is not 100% guaranteed to happen every second, due to process scheduling issues.

The default value of 1 is required for full ACID compliance. You can achieve better performance by setting the value different from 1, but then you can lose up to one second worth of transactions in a crash. With a value of 0, any mysqld process crash can erase the last second of transactions. With a value of 2, only an operating system crash or a power outage can erase the last second of transactions.

(3) innodb_support_xa

用于控制 innodb 是否支持 XA 事务的 2PC，默认是 TRUE。如果关闭，则 innodb 在 prepare 阶段就什么也不做；这可能会导致 binlog 的顺序与 innodb 提交的顺序不一致(比如 A 事务比 B 事务先写 binlog，但是在 innodb 内部却可能 A 事务比 B 事务后提交)，这会导致在恢复或者 slave 产生不同的数据。

int

innobase_xa_prepare(

/*================*/

/* out: 0 or error number */

THD* thd, /* in: handle to the MySQL thread of the user

whose XA transaction should be prepared */

bool all) /* in: TRUE – commit transaction

FALSE – the current SQL statement ended */

{

…

if (!thd- variables.innodb_support_xa) {

return(0);

}

ver mysql 5.7
bool trans_xa_commit(THD *thd)
{ bool res= TRUE; enum xa_states xa_state= thd- transaction.xid_state.xa_state;
 DBUG_ENTER(trans_xa_commit  if (!thd- transaction.xid_state.xid.eq(thd- lex- xid))
 { /* xid_state.in_thd is always true beside of xa recovery procedure.
 Note, that there is no race condition here between xid_cache_search
 and xid_cache_delete, since we always delete our own XID
 (thd- lex- xid == thd- transaction.xid_state.xid).
 The only case when thd- lex- xid != thd- transaction.xid_state.xid
 and xid_state- in_thd == 0 is in the function
 xa_cache_insert(XID, xa_states), which is called before starting
 client connections, and thus is always single-threaded. */ XID_STATE *xs= xid_cache_search(thd- lex- xid);
 res= !xs || xs- in_thd; if (res)
 my_error(ER_XAER_NOTA, MYF(0)); else { res= xa_trans_rolled_back(xs);
 ha_commit_or_rollback_by_xid(thd, thd- lex- xid, !res);
 xid_cache_delete(xs);
 }
 DBUG_RETURN(res);
 } if (xa_trans_rolled_back( thd- transaction.xid_state))
 { xa_trans_force_rollback(thd);
 res= thd- is_error();
 } else if (xa_state == XA_IDLE   thd- lex- xa_opt == XA_ONE_PHASE)
 { int r= ha_commit_trans(thd, TRUE); if ((res= MY_TEST(r)))
 my_error(r == 1 ? ER_XA_RBROLLBACK : ER_XAER_RMERR, MYF(0));
 } else if (xa_state == XA_PREPARED   thd- lex- xa_opt == XA_NONE)
 {
 MDL_request mdl_request; /* Acquire metadata lock which will ensure that COMMIT is blocked
 by active FLUSH TABLES WITH READ LOCK (and vice versa COMMIT in
 progress blocks FTWRL).
 We allow FLUSHer to COMMIT; we assume FLUSHer knows what it does. */ mdl_request.init(MDL_key::COMMIT,  ,  , MDL_INTENTION_EXCLUSIVE,
 MDL_TRANSACTION); if (thd- mdl_context.acquire_lock( mdl_request,
 thd- variables.lock_wait_timeout))
 { ha_rollback_trans(thd, TRUE);
 my_error(ER_XAER_RMERR, MYF(0));
 } else { DEBUG_SYNC(thd,  trans_xa_commit_after_acquire_commit_lock  if (tc_log)
 res= MY_TEST(tc_log- commit(thd, /* all */ true)); else res= MY_TEST(ha_commit_low(thd, /* all */ true)); if (res)
 my_error(ER_XAER_RMERR, MYF(0));
 }
 } else { my_error(ER_XAER_RMFAIL, MYF(0), xa_state_names[xa_state]);
 DBUG_RETURN(TRUE);
 }
 thd- variables.option_bits = ~OPTION_BEGIN;
 thd- transaction.all.reset_unsafe_rollback_flags();
 thd- server_status =
 ~(SERVER_STATUS_IN_TRANS | SERVER_STATUS_IN_TRANS_READONLY);
 DBUG_PRINT(info , ( clearing SERVER_STATUS_IN_TRANS));
 xid_cache_delete(thd- transaction.xid_state);
 thd- transaction.xid_state.xa_state= XA_NOTR;
 DBUG_RETURN(res);
}

5.3.4 安全性 / 性能讨论

上面 3 个参数不同的值会带来不同的效果。三者都设置为 1(TRUE)，数据才能真正安全。sync_binlog 非 1，可能导致 binlog 丢失(OS 挂掉)，从而与 innodb 层面的数据不一致。innodb_flush_log_at_trx_commit 非 1，可能会导致 innodb 层面的数据丢失(OS 挂掉)，从而与 binlog 不一致。

关于性能分析，可以参考

http://www.mysqlperformanceblog.com/2011/03/02/what-is-innodb_support_xa/

http://www.mysqlperformanceblog.com/2009/01/21/beware-ext3-and-sync-binlog-do-not-play-well-together/

在事务提交时 innobase 会调用 ha_innodb.cc  中的 innobase_commit，而 innobase_commit 通过调用 trx_commit_complete_for_mysql（trx0trx.c)来调用 log_write_up_to（log0log.c), 也就是当 innobase 提交事务的时候就会调用 log_write_up_to 来写 redo log
innobase_commit 中  if (all #  如果是事务提交  || (!thd_test_options(thd, OPTION_NOT_AUTOCOMMIT | OPTION_BEGIN))) {通过下面的代码实现事务的 commit 串行化  if (innobase_commit_concurrency   0) { pthread_mutex_lock( commit_cond_m);
 commit_threads++; if (commit_threads   innobase_commit_concurrency) {
 commit_threads--;
 pthread_cond_wait(commit_cond,  commit_cond_m);
 pthread_mutex_unlock(commit_cond_m); goto retry;
 } else { pthread_mutex_unlock( commit_cond_m);
 }
 }
 trx- flush_log_later = TRUE; #  在做提交操作时禁止 flush binlog  到磁盘
 innobase_commit_low(trx);
 trx- flush_log_later = FALSE;
先略过 innobase_commit_low 调用  , 下面开始调用 trx_commit_complete_for_mysql 做 write 日志操作
trx_commit_complete_for_mysql(trx); # 开始 flush log
trx- active_trans = 0;
在 trx_commit_complete_for_mysql 中，主要做的是对系统参数 srv_flush_log_at_trx_commit 值做判断来调用
log_write_up_to，或者 write redo log file 或者 write flush to disk if (!trx- must_flush_log_later) { /* Do nothing */ } else if (srv_flush_log_at_trx_commit == 0) { #flush_log_at_trx_commit=0，事务提交不写 redo log /* Do nothing */ } else if (srv_flush_log_at_trx_commit == 1) { #flush_log_at_trx_commit=1, 事务提交写 log 并 flush 磁盘, 如果 flush 方式不是 SRV_UNIX_NOSYNC （这个不是很熟悉） if (srv_unix_file_flush_method == SRV_UNIX_NOSYNC) { /* Write the log but do not flush it to disk */ log_write_up_to(lsn, LOG_WAIT_ONE_GROUP, FALSE);
 } else { /* Write the log to the log files AND flush them to
 disk */ log_write_up_to(lsn, LOG_WAIT_ONE_GROUP, TRUE);
 }
 } else if (srv_flush_log_at_trx_commit == 2) { # 如果是 2，则只 write 到 redo log /* Write the log but do not flush it to disk */ log_write_up_to(lsn, LOG_WAIT_ONE_GROUP, FALSE);
 } else {
 ut_error;
 }
那么下面看 log_write_up_to if (flush_to_disk # 如果 flush 到磁盘，则比较当前 commit 的 lsn 是否大于已经 flush 到磁盘的 lsn   ut_dulint_cmp(log_sys- flushed_to_disk_lsn, lsn)  = 0) { mutex_exit( (log_sys- mutex)); return;
 } if (!flush_to_disk # 如果不 flush 磁盘则比较当前 commit 的 lsn 是否大于已经写到所有 redo log file 的 lsn, 或者在只等一个 group 完成条件下是否大于已经写到某个 redo file 的 lsn   (ut_dulint_cmp(log_sys- written_to_all_lsn, lsn)  = 0 || (ut_dulint_cmp(log_sys- written_to_some_lsn, lsn)  = 0   wait != LOG_WAIT_ALL_GROUPS))) { mutex_exit( (log_sys- mutex)); return;
 }
#下面的代码判断是否 log 在 write, 有的话等待其完成  if (log_sys- n_pending_writes   0) { if (flush_to_disk #  如果需要刷新到磁盘，如果正在 flush 的 lsn 包括了 commit 的 lsn，只要等待操作完成就可以了    ut_dulint_cmp(log_sys- current_flush_lsn, lsn)  = 0) { goto do_waits;
 } if (!flush_to_disk #  如果是刷到 redo log file 的那么如果在 write 的 lsn 包括了 commit 的 lsn, 也只要等待就可以了    ut_dulint_cmp(log_sys- write_lsn, lsn)  = 0) { goto do_waits;
 }
 ...... if (!flush_to_disk #  如果在当前 IO 空闲情况下  ，而且不需要 flush 到磁盘，那么   如果下次写的位置已经到达 buf_free 位置说明 wirte 操作都已经完成了，直接返回    log_sys- buf_free == log_sys- buf_next_to_write) { mutex_exit( (log_sys- mutex)); return;
 }
下面取到 group, 设置相关 write or flush 相关字段，并且得到起始和结束位置的 block 号
log_sys- n_pending_writes++;
 group = UT_LIST_GET_FIRST(log_sys- log_groups);
 group- n_pending_writes++; /* We assume here that we have only
 one log group! */ os_event_reset(log_sys- no_flush_event);
 os_event_reset(log_sys- one_flushed_event);
 start_offset = log_sys- buf_next_to_write;
 end_offset = log_sys- buf_free;
 area_start = ut_calc_align_down(start_offset, OS_FILE_LOG_BLOCK_SIZE);
 area_end = ut_calc_align(end_offset, OS_FILE_LOG_BLOCK_SIZE);
 ut_ad(area_end - area_start   0);
 log_sys- write_lsn = log_sys-  if (flush_to_disk) {
 log_sys- current_flush_lsn = log_sys- 
 }
log_block_set_checkpoint_no 调用设置 end_offset 所在 block 的 LOG_BLOCK_CHECKPOINT_NO 为 log_sys 中下个检查点号
 log_block_set_flush_bit(log_sys- buf + area_start, TRUE); #  这个没看明白
 log_block_set_checkpoint_no(
 log_sys- buf + area_end - OS_FILE_LOG_BLOCK_SIZE,
 log_sys- next_checkpoint_no);
保存不属于 end_offset 但在其所在的 block 中的数据到下一个空闲的 block
ut_memcpy(log_sys- buf + area_end,
 log_sys- buf + area_end - OS_FILE_LOG_BLOCK_SIZE,
 OS_FILE_LOG_BLOCK_SIZE);
对于每个 group 调用 log_group_write_buf 写 redo log buffer while (group) {
 log_group_write_buf(
 group, log_sys- buf + area_start,
 area_end - area_start,
 ut_dulint_align_down(log_sys- written_to_all_lsn,
 OS_FILE_LOG_BLOCK_SIZE),
 start_offset - area_start);
 log_group_set_fields(group, log_sys- write_lsn); #  计算这次写的 lsn 和 offset 来设置 group- lsn 和 group- lsn_offset
 group = UT_LIST_GET_NEXT(log_groups, group);
 }
...... if (srv_unix_file_flush_method == SRV_UNIX_O_DSYNC) { #  这个是什么东西  /* O_DSYNC means the OS did not buffer the log file at all:
 so we have also flushed to disk what we have written */ log_sys- flushed_to_disk_lsn = log_sys- write_lsn;
 } else if (flush_to_disk) { group = UT_LIST_GET_FIRST(log_sys- log_groups);
 fil_flush(group- space_id); #  最后调用 fil_flush 执行 flush 到磁盘
 log_sys- flushed_to_disk_lsn = log_sys- write_lsn;
 }
接下来看 log_group_write_buf 做了点什么
在 log_group_calc_size_offset 中, 从 group 中取到上次记录的 lsn 位置（注意是 log files 组成的 1 个环状 buffer), 并计算这次的 lsn 相对于上次的差值
#  调用 log_group_calc_size_offset 计算 group- lsn_offset 除去多个 LOG_FILE 头部长度后的大小，比如 lsn_offset 落在第 3 个 log file 上，那么需要减掉 3 *LOG_FILE_HDR_SIZE 的大小
gr_lsn_size_offset = (ib_longlong)
 log_group_calc_size_offset(group- lsn_offset, group);
group_size = (ib_longlong) log_group_get_capacity(group); #  计算 group 除去所有 LOG_FILE_HDR_SIZE 长度后的 DATA 部分大小
#  下面是典型的环状结构差值计算  if (ut_dulint_cmp(lsn, gr_lsn)  = 0) { difference = (ib_longlong) ut_dulint_minus(lsn, gr_lsn);
 } else { difference = (ib_longlong) ut_dulint_minus(gr_lsn, lsn);
 difference = difference % group_size;
 difference = group_size - difference;
 }
 offset = (gr_lsn_size_offset + difference) % group_size;
#  最后算上每个 log file  头部大小，返回真实的 offset return(log_group_calc_real_offset((ulint)offset, group));
#  如果需要写的内容超过一个文件大小  if ((next_offset % group- file_size) + len   group- file_size) { write_len = group- file_size #  写到 file 末尾  - (next_offset % group- file_size);
 } else {
 write_len = len; #  否者写 len 个 block
 }
#  最后真正的内容就是写 buffer 了，如果跨越 file 的话另外需要写 file log file head 部分  if ((next_offset % group- file_size == LOG_FILE_HDR_SIZE)   write_header) { /* We start to write a new log file instance in the group */ log_group_file_header_flush(group,
 next_offset / group- file_size,
 start_lsn);
 srv_os_log_written+= OS_FILE_LOG_BLOCK_SIZE;
 srv_log_writes++;
 }
#  调用 fil_io 来执行 buffer 写  if (log_do_write) {
 log_sys- n_log_ios++;
 srv_os_log_pending_writes++;
 fil_io(OS_FILE_WRITE | OS_FILE_LOG, TRUE, group- space_id,
 next_offset / UNIV_PAGE_SIZE,
 next_offset % UNIV_PAGE_SIZE, write_len, buf, group);
 srv_os_log_pending_writes--;
 srv_os_log_written+= write_len;
 srv_log_writes++;

然而我们考虑如下序列（Copy from worklog…）

 Trx1 ------------P----------C-------------------------------- 
 |
 Trx2 ----------------P------+---C---------------------------- 
 | |
 Trx3 -------------------P---+---+-----C---------------------- 
 | | |
 Trx4 -----------------------+-P-+-----+----C----------------- 
 | | | |
 Trx5 -----------------------+---+-P---+----+---C------------- 
 | | | | |
 Trx6 -----------------------+---+---P-+----+---+---C---------- 
 | | | | | |
 Trx7 -----------------------+---+-----+----+---+-P-+--C------- 
 | | | | | | |

在之前的逻辑中，trx5 和 trx6 是可以并发执行的，因为他们拥有相同的序列号；Trx4 无法和 Trx5 并行，因为他们的序列号不同。同样的 trx6 和 trx7 也无法并行。当发现一个无法并发的事务时，就需要等待前面的事务执行完成才能继续下去，这会影响到备库的 TPS。

但是理论上 trx4 应该可以和 trx5 和 trx6 并行，因为 trx4 先于 trx5 和 trx6 prepare，如果 trx5 和 trx6 能进入 Prepare 阶段，证明其和 trx4 是没有冲突的。

解决方案：

0. 增加两个全局变量：

/* Committed transactions timestamp */
Logical_clock max_committed_transaction;
/* Prepared transactions timestamp */
Logical_clock transaction_counter;

每个事务对应两个 counter：last_committed 及 sequence_number

1. 每次 rotate 或打开新的 binlog 时

MYSQL_BIN_LOG::open_binlog:

max_committed_transaction.update_offset(transaction_counter.get_timestamp());
transaction_counter.update_offset(transaction_counter.get_timestamp());

— 更新 max_committed_transaction 和 transaction_counter 的 offset 为当前的 state 值（或者说，为上个 Binlog 文件最大的 transaction counter 值）

2. 每执行一条 DML 语句完成时，更新当前会话的 last_committed= mysql_bin_log.max_committed_transaction

参考函数：binlog_prepare（参数 all 为 false）

3. 事务提交时，写入 binlog 之前

binlog_cache_data::flush:

trn_ctx- sequence_number= mysql_bin_log.transaction_counter.step();

其中 transaction_counter 递增 1

4. 写入 binlog

将 sequence_number 和 last_committed 写入 binlog

MYSQL_BIN_LOG::write_gtid

记录 binlog 文件的 seq number 和 last committed 会减去 max_committed_transaction.get_offset()，也就是说，每个 Binlog 文件的序列号总是从 (last_committed, sequence_number)=(0,1) 开始

5. 引擎层提交每个事务前更新 max_committed_transaction

如果当前事务的 sequence_number 大于 max_committed_transaction，则更新 max_committed_transaction 的值

MYSQL_BIN_LOG::process_commit_stage_queue – MYSQL_BIN_LOG::update_max_committed

6. 备库并发检查

函数：Mts_submode_logical_clock::schedule_next_event

假设初始状态下 transaction_counter=1, max_committed_transaction=1，以上述流程为例，每个事务的 last_committed, sequence_number 序列为:

Trx1 prepare: last_commited = max_committed_transaction = 1;

Trx2 prepare: last_commited = max_committed_transaction = 1;

Trx3 prepare: last_commited = max_committed_transaction = 1;

Trx1 commit: sequence_number=++transaction_counter = 2, (transaction_counter=2, max_committed_transaction=2), write(1,2)

Trx4 prepare: last_commited =max_committed_transaction = 2;

Trx2 commit: sequence_number=++transaction_counter= 3, (transaction_counter=3, max_committed_transaction=3), write(1,3)

Trx5 prepare: last_commited = max_committed_transaction = 3;

Trx6 prepare: last_commited = max_committed_transaction = 3;

Trx3 commit: sequence_number=++transaction_counter= 4, (transaction_counter=4, max_committed_transaction=4), write(1,4)

Trx4 commit: sequence_number=++transaction_counter= 5, (transaction_counter=5, max_committed_transaction=5), write(2, 5)

Trx5 commit: sequence_number=++transaction_counter= 6, (transaction_counter=6, max_committed_transaction=6), write(3, 6)

Trx7 prepare: last_commited = max_committed_transaction = 6;

Trx6 commit: sequence_number=++transaction_counter= 7, (transaction_counter=7, max_committed_transaction=7), write(3, 7)

Trx7 commit: sequence_number=++transaction_counter= 8, (transaction_counter=8, max_committed_transaction=8), write(6, 8)

并发规则：

因此上述序列中可以并发的序列为：

trx1 1…..2

trx2 1………….3

trx3 1…………………….4

trx4 2………………………….5

trx5 3………………………………..6

trx6 3………………………………………………7

trx7 6……………………..8

备库并行规则：当分发一个事务时，其 last_committed 序列号比当前正在执行的事务的最小 sequence_number 要小时，则允许执行。

因此，

a)trx1 执行，last_commit 2 的可并发，trx2, trx3 可继续分发执行

b)trx1 执行完成后，last_commit 3 的可以执行，trx4 可分发

c)trx2 执行完成后，last_commit 4 的可以执行，trx5, trx6 可分发

d)trx3、trx4、trx5 完成后，last_commit 7 的可以执行，trx7 可分发

关于如何理解 MYSQL-GroupCommit 和 2pc 提交就分享到这里了，希望以上内容可以对大家有一定的帮助，可以学到更多知识。如果觉得文章不错，可以把它分享出去让更多的人看到。

正文完

log redo 事务提交日志

发表至：数据库

2023-07-19

转载说明：除特殊说明外本站除技术相关以外文章皆由网络搜集发布，转载请注明出处。

如何安装navicat

PostgreSQL中citus节点间的网络需求有哪些

怎么使用Mysql来管理关系型数据库

怎么安装并激活Navicat Premium 12.0.29

如何进行MySQL索引条件下推的简单测试