Skip to content

记一次Linux D状态进程故障排查

最近RK3588 Linux 6.1内核加入了redroid的支持,于是便计划将内核版本更新到6.1。升级完成后清理磁盘空间时,发现rm命令执行的非常慢,以往几秒钟就能删除的文件,现在需要一到两分钟才能删除掉。

删除东西慢,笔者首先想到是不是磁盘有问题。启动htop观之,却未发现有任何进行大量磁盘读写的进程。随后,测试发现,使用dd向磁盘写入大量数据,也未出现卡顿现象。
既然不是磁盘问题,那就只有可能是内核问题了。笔者将内核版本降回5.10,删除文件速度立马恢复至数秒内,升级至6.1后故障依旧。

再次删除文件,启动htop仔细观察,发现rm进程的状态为D。经过查阅后发现,D状态为disk sleep,为内核等待磁盘资源响应时赋予进程的状态。

查看rm的调用栈:

bash
sudo cat /proc/`pidof rm`/stack
[<0>] submit_bio_wait+0x64/0x98
[<0>] blkdev_issue_discard+0x84/0xd8
[<0>] ext4_issue_discard.constprop.0+0x9c/0xb0
[<0>] ext4_free_blocks+0x7cc/0x888
[<0>] ext4_ext_remove_space+0xa0c/0xe84
[<0>] ext4_ext_truncate+0x94/0xb8
[<0>] ext4_truncate+0x254/0x324
[<0>] ext4_evict_inode+0x2f8/0x438
[<0>] evict+0xbc/0x15c
[<0>] iput+0x184/0x1c4
[<0>] do_unlinkat+0x190/0x234
[<0>] __arm64_sys_unlinkat+0x64/0x70
[<0>] invoke_syscall+0x8c/0x128
[<0>] el0_svc_common.constprop.0+0x9c/0x13c
[<0>] do_el0_svc+0xa8/0xb8
[<0>] el0_svc+0x64/0xc4
[<0>] el0t_64_sync_handler+0xac/0x13c
[<0>] el0t_64_sync+0x19c/0x1a0

查看ext4_free_blocks函数的实现,发现在函数末尾调用了ext4_mb_clear_bb函数,查看该函数:

c
static void ext4_mb_clear_bb(handle_t *handle, struct inode *inode,
			       ext4_fsblk_t block, unsigned long count,
			       int flags)
{
	struct buffer_head *bitmap_bh = NULL;
	struct super_block *sb = inode->i_sb;
	...
	if (ext4_handle_valid(handle) &&
	    ((flags & EXT4_FREE_BLOCKS_METADATA) ||
	     !ext4_should_writeback_data(inode))) {
		struct ext4_free_data *new_entry;
		/*
		 * We use __GFP_NOFAIL because ext4_free_blocks() is not allowed
		 * to fail.
		 */
		new_entry = kmem_cache_alloc(ext4_free_data_cachep,
				GFP_NOFS|__GFP_NOFAIL);
		new_entry->efd_start_cluster = bit;
		new_entry->efd_group = block_group;
		new_entry->efd_count = count_clusters;
		new_entry->efd_tid = handle->h_transaction->t_tid;

		ext4_lock_group(sb, block_group);
		mb_clear_bits(bitmap_bh->b_data, bit, count_clusters);
		ext4_mb_free_metadata(handle, &e4b, new_entry);
	} else {
		/* need to update group_info->bb_free and bitmap
		 * with group lock held. generate_buddy look at
		 * them with group lock_held
		 */
		if (test_opt(sb, DISCARD)) {
			err = ext4_issue_discard(sb, block_group, bit,
						 count_clusters, NULL);
			if (err && err != -EOPNOTSUPP)
				ext4_msg(sb, KERN_WARNING, "discard request in"
					 " group:%u block:%d count:%lu failed"
					 " with %d", block_group, bit, count,
					 err);
		}
    }
		...
}

根据调用栈中ext4_issue_discard函数,可以看出test_opt(sb, DISCARD)条件为真。将test_opt宏展开后,发现该宏是用于判断挂载参数是否启用,如果启用则为真。
查看挂载状态:

bash
mount | grep /dev/mmcblk1p1
/dev/mmcblk0p1 on / type ext4 (rw,noatime,errors=remount-ro,commit=600)

可以看见挂载选项中并没有discard选项,那为什么这里条件判断会为真呢?
fs/ext4下全局搜索DISCARD字符串,最终在super.c里面找到了答案:

c
/* discard enabled by default for Rockchip; disable with nodiscard */
	if (IS_ENABLED(CONFIG_ARCH_ROCKCHIP) ||
	    (def_mount_opts & EXT4_DEFM_DISCARD))
		set_opt(sb, DISCARD);

将这三行代码注释后重新编译内核,问题解决。