-
-
Notifications
You must be signed in to change notification settings - Fork 15.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linux>3.19.x fails to mount root ZFS on NVMe during stage 1 #11003
Comments
Works for me. I have tried zfs on root and zfs on encrypted root. I had similar errors as you but most of the time it was my configuration issues. Please show your config for disks/grub/zfs. |
So, I just rebooted and can give some more information.
@spinus: Not using grub. This machine boots through UEFI with gummiboot. The kernel thus sits in an EFI system partition. I've posted the rest above. Those mountpoints sure sound bogus though. I had created those filesystems using EDIT Right. It's just the pools themselves that seem to have a mountpoint set. Maybe I'm missing something, but I don't think that pools should require a mountpoint. :) |
I think the problem here is that ZFS is not even finding your pool when you boot with a newer kernel:
This usually happens when the kernel cannot see your storage devices (e.g. due to missing drivers, due to a race condition with udev, etc...). Can you try booting with kernel 3.19.x, but this time adding:
... to your Just as an FYI, I'm also using a ZFS root pool with kernel 4.1.x, with ZFS on top of an encrypted luks device, and also in UEFI mode (with gummiboot, currently). As you can see, it's supposed to work fine :-) |
I'll report back when the machine becomes unutilized (and I can reboot without clamoring users) later today. Yes, the only things that change in |
I was able to sneak in a reboot. Here's what I found: NVMe device nodes do show up in
After that, the pool seems to be online. Nevertheless, Anything else I should try? EDIT I've rebooted the machine to the old kernel for the time being. EDIT Could this be a result of the bogus mountpoints defined for the pool itself? EDIT I just tried to import my pools using the
I imported my other pool as well though it's not essential for booting. As one can see, no error is reported. EDIT I've gone through my journal after an old-kernel boot and found a few things related to ZFS that may provide hints.
|
Posting a new comment because all that mountpoint jazz turns out to be a diversion from the main issue. I rebooted into a new kernel and decided to look into dmesg (why I didn't think of this earlier, I don't know) and found something interesting:
Which indicates that NVMe is initialized almost five seconds after almost everything else during that boot has finished. Of course, when I start a rescue shell and import that pool manually, everything looks fine since five seconds is not that much time - for a human. I found a commit in the kernel git repository that may be pertinent: torvalds/linux@2e1d844 This commit has been in the kernel since 4.0-rc1, which lines up with when my problems started. |
Sounds like we need to wait for the NVMe driver to finish enumerating all devices before starting the zpool import. This is not a unique problem to NixOS, other distros have needed to resolve this. The ordering can be done with systemd dependencies. I'll research how to have the NVMe driver raise an objection while it is enumerating, and lower it when the devices are available. |
All of this happens during stage 1, and AFAICS systemd does not yet run then. The stage 1 shell script could simply wait in a loop until the pool containing the root FS became available. A ten second timeout wouldn't hurt, either. The latter seems to be what other distributions do when In an ideal world, one could wait for the relevant block device's device driver module to signal when it has completely finished probing. It may even be possible for NVMe, but I suspect that it's not possible to do that in the general case because from Another thought. If mounting the root (Z)FS was managed through ZFS ( EDIT But I suspect this may not be as easy since the mountpoint shifts between stages ( EDIT I also found two interesting NixOS tidbits surrounding the core issue of slowly initialized root FS block devices:
|
Hrm. Is there no /me wanders off to look. |
There is a If that fails, we might hack a workaround and inject a |
The fix recommended by the NVMe driver maintainers is to pass |
I'm not familiar with using zfs with nvme, so I might be off base here. If the "udev" params are not helping (which I thought were for systemd), try adding "rootdelay=120" instead. |
Thanks for the suggestions. I tried booting a new kernel with the first suggested set of arguments, then with the second singular parameter. Unfortunately, neither of these has any observable effect - I didn't see any delay during boot, and, of course, NVMe still added the needed device nodes way after Is the first set really intended for EDIT Looks like at least one of the parameters is being read here: https://github.com/systemd/systemd/blob/master/src/udev/udevd.c#L1360 EDIT My kernel command line does not even define |
It's been a while since I hacked on the udev boot stuff, but I wonder if the |
Let's raise this blocker from the dead: According to openzfs/zfs#2199 (comment),
As openzfs/zfs#330 implies, this would actually need to wait for all constituent devices of a zpool to show up. I really don't know what kernel devs think userspace is supposed to do in this case. A variation on this problem (this time, with USB) is discussed in https://bugs.archlinux.org/task/11571. They go for the two workarounds described above. Here's where I first read that |
Heh, I guess we can't please everyone. :) The driver used to complete device discovery serially, blocking initialization until complete. Then user space software decided to kill the init process because it takes "too long" with enough devices present. I filed https://bugzilla.redhat.com/show_bug.cgi?id=1191726 for the user space devs to explain why, and the conclusion was the kernel driver needs to discover NVMe faster. The only way to go faster is to parallelize, so the driver completes device "probe" before the storage is surfaced so that it can move on to the next controller. You say USB has similar behavior and problems, but I assume other storage device under the SCSI stack works correctly. Not sure if/how they're synchronizing with user space in some way, but will look into it. |
Looks like SCSI uses an async schedule where NVMe uses a work queue. I'm guessing that's the key difference based on how "wait_for_device_probe()" synchronizes. |
@evujumenuk what's the status of this? |
Well, since no one seems to know (or maybe, no one is inclined to tell) how userspace is supposed to wait for the kernel to finish probing all devices, the status is exactly as it was in November. I can try The machine continues to run 3.18.x, for now. |
After spending a few more brain cycles on this problem, I think that a flag to In an ideal world, the init script would simply block until the root filesystem magically appeared, because
Thoughts? |
I have the same issue right now. I don't have an issue on 3 physical machines (laptop, pc).
If I boot machine and pass kernel param ``boot.shell_on_fail` and login, wait a moment and than "ctrl+d" it will load the system properly. So this might be some async/race condition here as mentioned. |
Same issue here. I can't add very much, but here's the workaround I'm using:
There's no hook in-between udevadm settle and zpool import, otherwise that last line wouldn't be needed. By all appearances, it takes 3-5 seconds for nvme0n1 to appear after nvme0 appears. A slightly more principled solution would be to retry the import until it works, with 0.1s in between or so. It's not like there's any point in continuing if root isn't there. |
...sorry about the spam! |
Works around NixOS#11003.
Since #16901 has been merged and backported to the stable release (thanks @Baughn!), this problem should no longer happen in most cases, so I'm closing. |
Works around NixOS#11003. (cherry picked from commit 98b213a) Reason: several people cannot boot with ZFS on NVMe
When using any kernel strictly newer than
pkgs.linuxPackages_3_19
asboot.kernelPackages
, the entire configuration fails to mount the root-on-ZFS filesystem during stage 1.Here is a transcript of a failing boot:
(Yes, I use Hungarian notation for pool and volume names. Stop judging!)
Pressing the
R
andEnter
keys results in a few dozen newlines being printed and a subsequent reboot. Simply pressing, e.g., justEnter
results in the following:I guess switch_root is being called with too few arguments.
I can boot just fine with 3.19.x and 3.18.x. I have reverted to using 3.18.x for now since 3.19.x has fallen out of nixpkgs.
I'd love to include the generated
/etc/fstab
, however, one of the disadvantages to 3.18.x prevents me from feeding any input to that machine's console. Right now, I am posting this from a virtual machine on that host that's had one of the host's USB controllers passed through to it via VFIO. The one which is still assigned to the host proper has had its driver die:Which simply means that I'll need to reboot the entire machine to do something like
cat /etc/fstab
. (sigh)So, yeah. Newer kernels won't mount the root (Z)FS whereas older ones just do it :)
NixOS unstable channel revision
7ae05ed
is being used.The text was updated successfully, but these errors were encountered: