SBR Replication, Replicate_Do_DB, and Database Qualifiers: A DANGEROUS MIX!

Replicação SBR, Replicate_Do_DB e qualificadores de banco: MISTURA PERIGOSA!
replication

No, that’s not me! Uncle is much cuter (a good definition for a “tidy ugly guy”). But this dumbfounded look fits me well, as I bring this topic to light. Honestly, I can’t say whether it’s a byta-bug or a configuration issue.

Despite the light-hearted introduction, the topic is serious and warrants a review of your replication environment configuration.

Over time, I’ve heard occasional and sporadic complaints about replication issues, where sometimes “it stops replicating.” However, in most cases, it involved only a “handful” of commands, mostly maintenance ones, which could be manually replicated, and thus, never justified a more in-depth forensic investigation.

Recently, one of my Padawans approached me with an intriguing statement: “A command was executed on the master server, but it wasn’t replicated to the slaves.”

Say what? Naturally, I replied with the most relevant question: “Check if the command was actually recorded in the Binlog (binary log), and also in the Relay Log.”

I was horrified by the answer! Yes, the command (an INSERT) was there! In other words, let’s lay out the scenario more clearly:

– 1 Master server and 3 Slave servers
– INSERT command on the Master (successfully applied), recorded in the Binlog (a requirement for replication)
– Command received by the SLAVES, properly recorded in the Relay Log (the Binlog counterpart)

The mystery: If the command was correctly applied on the Master (the row was indeed inserted into the table), recorded in the Binlog, and replicated to the Slaves’ Relay Logs, why wasn’t the row inserted into the table on the Slaves?

Clearly, there was no issue on the Master server. MySQL did what was expected: it inserted the row into the table and recorded it in the Binlog.

Could the problem be with the Slaves? All three?? Had some Lord Sith taken over the unsuspecting MySQL Slave servers?

I was intrigued and decided to put an end to this mess once and for all. A Jedi, but in love with Darth Vader, I decided to turn to the Force to solve this mystery. And many surprises would unfold!

I set up a simple environment with two MySQL instances running on the same virtual machine. If you want to replicate this experience, spin up two MySQL instances (one on port 3306 and the other on port 3307), configured with one-way Master-Slave replication, paying attention to the following configurations, in line with our case study:

– Additional configuration for the MySQL Master server:
binlog_format = STATEMENT

– Additional configuration for the MySQL Slave server:
replicate-do-db = bug_da_replicacao_1

replication

Next, let’s connect to the MySQL Master server using your preferred SQL client to create the database bug_da_replicacao_1, which will be the target of our experiment:

sqlCopiarEditarCREATE TABLE `tabela_1` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `rnd` int(11) DEFAULT NULL,
  UNIQUE KEY `id` (`id`)
) ENGINE=InnoDB;

Let’s populate the tabela_1 table through the MySQL Master server:

sqlCopiarEditarUSE mysql;

INSERT INTO bug_da_replicacao_1.tabela_1 VALUES (null, now(), rand()*1000000);
INSERT INTO bug_da_replicacao_1.tabela_1 VALUES (null, now(), rand()*1000000);
INSERT INTO bug_da_replicacao_1.tabela_1 VALUES (null, now(), rand()*1000000);

The intention of the USE command is to switch out of the bug_da_replicacao_1 database, which is necessary for this study.

Now, let’s connect to the MySQL Slave server instance to check whether the data was replicated:

sqlCopiarEditarmysql> SELECT * FROM bug_da_replicacao_1.tabela_1;
Empty set (0.01 sec)

It wasn’t! You could check the Binlog and Relay Log, because those commands will be there. This is where the Imperial March (famous Star Wars theme song) begins! Rurrrrzz. Stop everything! You don’t know it? Then why are you reading this article?

Upon analysis, possible failure points:

  • Binlog: On the Master side, no failure—command is recorded in the Binlog
  • IO_Thread and Relay Log: On the Slave side, also fine—the IO_Thread fetched the command from the Master
  • SQL_Thread: Aha! Here lies the issue! The SQL_Thread is failing; it’s not executing the SQL command, even though it’s perfectly recorded in the Relay Log.

Indeed, the problem is with the SQL_Thread, but because of a combination of factors:

  • The most significant is that replication is SBR (Statement-Based Replication)
  • Just as important, the Slave is configured to replicate only a specific database using: replicate-do-db = bug_da_replicacao_1
  • And finally, the assumption that qualifying the database (i.e., database.table, in our case: bug_da_replicacao_1.tabela_1) would ensure that the command would be interpreted correctly on the Slave
replication

Now, I propose another test. On the MySQL Master:

sqlCopiarEditarUSE bug_da_replicacao_1;

INSERT INTO bug_da_replicacao_1.tabela_1 VALUES (null, now(), 1234);
INSERT INTO tabela_1 VALUES (null, now(), 4321);

Let’s check replication on the MySQL Slave:

sqlCopiarEditarmysql> SELECT * FROM bug_da_replicacao_1.tabela_1;
+----+---------------------+------+
| id | dt                  | rnd  |
+----+---------------------+------+
|  4 | 2013-03-17 17:39:45 | 1234 |
|  5 | 2013-03-17 17:39:46 | 4321 |
+----+---------------------+------+
2 rows in set (0.00 sec)

“Et voilà”! When connected directly to the database, with or without qualification, the command is correctly replicated!

That’s why I keep insisting: the problem is not with SBR, or with database qualification, or whether you’re connected to the DB, or with filtering via replicate-do-db, or with the SQL_Thread. It’s not in any of these factors individually. The issue lies in the combination of them.

SBR ruled supreme in MySQL versions 3 and 4—robust and strong. In version 5.0, RBR (Row-Based Replication) was introduced—much more precise. In MySQL 5.1, MySQL gained the ability to decide which replication type is more efficient and safe: SBR or RBR. This is configured with binlog_format = MIXED. There is no such thing as MBR (Mixed-Based Replication). The MIXED parameter lets MySQL decide the best replication type to use.

You might ask:

“Uncle, should I never use SBR again?”

No, dear Padawan. You should use SBR! Especially for large maintenance operations like massive UPDATEs and DELETEs—RBR is completely inefficient for those cases. I ask that you understand the limitations and differences between SBR and RBR before using SBR. Once you’ve determined it’s suitable, please, dear Padawan: You must connect to the database!

“Uncle, we use SBR a lot in the company and never had this problem. Why?”

Simple, future Jedi. The issue only surfaced because this was a manual maintenance, not regular application use. Normally, applications connect to the database with: Host/IP, Port, Username, Password, and—Database! Got it? Developers are used to specifying the database, which prevents this kind of failure.

So, we must advise our developers to always provide the database name in the connection object (the connector).

Again! I don’t want you to stop using SBR, especially for large-scale maintenance. But if you’re using MySQL 5.1 or later, your MySQL’s health is much better off with binlog_format = MIXED.

Dear students, former students, and friends, have a great rainy Sunday. Enjoy what you love—regardless of what “others” say.

Schedule a meeting here

Visit our Blog

Learn more about databases

Learn about monitoring with advanced tools

replication

Have questions about our services? Visit our FAQ

Want to see how we’ve helped other companies? Check out what our clients say in these testimonials!

Discover the History of HTI Tecnologia

Compartilhar: