With a minimal knowledge about replication and especially the replication lag, I started to investigate serious issues with our data analytics. The replication lag was more than 18000 seconds and the team members couldn’t reach the data they needed. Here’s how we fixed it.
1. Understand the problem
At the beginning, it was very hard to understand the root cause of a replication lag. It was a long going issues, we a periodical replica reboots as a problem solving method.
There is quite informative article at the AWS Docs - How can I troubleshoot high replica lag with Amazon RDS for MySQL?. It wasn’t very clear and informative at the beginning, but it really has a value to understand the problem.
It’s important to verify the type of replication lag
- IO thread lag
- SQL thread lag
Run following command at the Primary database:
SHOW MASTER STATUS;
***** example output *****
Master_Log_File: mysql-bin.066548
Read_Master_Log_Pos: 10050480
Relay_Master_Log_File: mysql-bin.066548
Exec_Master_Log_Pos: 10050300
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
And run following command at the Replica’s database:
SHOW SLAVE STATUS;
***** example output *****
Master_Log_File: mysql-bin.066552
Read_Master_Log_Pos: 430
Relay_Master_Log_File: mysql-bin.066530
Exec_Master_Log_Pos: 50360
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
We had the same problem as the documentation described, but our Relay_Master_Log_File was different (by huge number). It was a bit confusing, but it was a good sign that we are on the right track and our problem is related to SQL thread lag.
The documentation also states following root causes of delays:
- Long-running queries on the primary DB instance
- Insufficient DB instance class size or storage
- Parallel queries run on the primary DB instance
- Binary logs synced to the disk on the replica DB instance
- Binlog_format on the replica is set to ROW
- Replica creation lag
Long-running queries on the primary DB instance
There were definitely some long-running queries, but we’re unsure if this was the issue of a huge lag. We can view the slow log at the replica with command SELECT * FROM slow_log ORDER BY query_time DESC;
, but I didn’t find any long-running queries there.
Insufficient DB instance class size or storage
We increased the storage size and instance class size. At the beginning it looked promising, but very soon the lag catched up again. The lag temporary decreased due to the replica reboot/restart done during the upgrade process. We lowered the replica size back again and didn’t observe any changes. (This might be a marketing strategy to sell more resources by AWS, but I’m not sure 🙂 )
Parallel queries run on the primary DB instance
We didn’t have any parallel queries running on the primary DB instance. We had a lot of queries running, but they were executed sequentially.
Binary logs synced to the disk on the replica DB instance
This was very confusing, because viewing the parameter log_bin
or command SHOW BINARY LOGS
didn’t show any results. The replica doesn’t seem to use binary logs.
Binlog_format on the replica is set to ROW
Another more confusing part. The parameter binlog_format
at the AWS RDS parameter group was set to MIXED
on both primary and replica. But when I checked the command SHOW VARIABLES LIKE 'log_bin%'
at the replica, it was set to ROW
. I didn’t find any information about this behavior at the AWS documentation. I tried to change the parameter group to MIXED
and restart the replica, or even create new parameter group with correct value, but it didn’t help. I wasn’t able to modify this parameter at the replica directly due to the permissions. This was my initial guess of the root cause.
Replica creation lag
We created new replica, with new parameter group but it didn’t help. The replication lag was “replicated” to the new replica as well.
Heureka moment
After a lot of research and testing, I have found and interesting inconsistence when running the command SHOW SLAVE STATUS
at the replica and SHOW MASTER STATUS
at the primary.
The difference was at the position
of the mysql-bin-changelog
. The replica showed higher value than the primary.
This indicates that the master has not yet written an event to the binary log file that the replica is waiting for. It is possible that there is a delay in the replication process or that the master has not yet executed the SQL statement that would generate the next binary log event.
The most common reason for this error is that there is a delay in the replication process due to a high workload on the master database, slow network connections, or other factors that could slow down the replication process. In some cases, there may be a problem with the SQL statement being executed on the master database, causing the delay in the binary log event generation.
To improve the performance and reduce the likelihood of this error occurring, you can consider adjusting the following parameters on the replica:
slave_net_timeout: This parameter specifies the number of seconds that the replica waits for a response from the master before timing out. Increasing this value can help improve performance if the network connection between the master and replica is slow. Our default value was 120.
slave_parallel_workers: This parameter specifies the number of worker threads that the replica can use to process events in parallel. Increasing this value can help improve performance if the replica has a high workload. Our default value was empty.
slave_transaction_retries: This parameter specifies the number of times that the replica should retry a failed transaction before giving up. Increasing this value can help improve performance if there are intermittent issues with the replication process. Our default value was empty.
slave_pending_jobs_size_max: This parameter specifies the maximum amount of memory that can be used to store pending replication jobs. Increasing this value can help improve performance if there are a large number of pending replication jobs. But it’s not recommended to increase value over the value set for primary. Our default value was 268435456.
Solution
I was sure the problem is related to SQL tread lag. Our metrics of RDS Bin log disk usage at Primary were raising up to 20Gb and the read latency of replica was higher more than 50% then primary. It looked like, the replica couldn’t catch up with the primary and couldn’t read the data from the binlog efficiently.
I set the parameter slave_parallel_workers
to 4, to enable parallel execution of replication events on the replica, using up to four parallel threads. This means that multiple events from the binary log can be processed at the same time, potentially reducing the replication lag on the replica. And it worked. The lag was reduced to 0s, occasionally raising to the few minutes during the heavy loads.
I recommend to test the parameter slave_parallel_workers
with different values to find the optimal value for your use case. The default value is empty, which means that parallel execution of replication events is disabled.
Important metrics to monitor
I recommend to follow the metrics below to monitor the replication lag and performance of the replica.
- RDS Bin log disk usage at the primary. After changing the parameter, the bin log disk usage was reduced from 20Gb to less than 1Gb.
- RDS Read latency at the primary and replica. The read latency of replica was reduced from 50% to 15% more than the primary.
- RDS Write IOPS at the replica. There was a 50-60% improvement in the write IOPS.
- Replication lag at the replica. Immediate reduction of the replication lag to 0s.