Discussion:
[systemd-devel] "bootctl install" on mdadm raid 1 fails
Bjørn Forsman
2017-12-08 22:33:01 UTC
Permalink
Hi all,

I assumed bootctl would be able to install onto a mdadm raid 1 array
(mirror). But this happens:

$ bootctl --path=/mnt/boot install
Failed to probe partition scheme "/mnt/boot": Input/output error

The raid array is created with --metadata=0.90 (superblock at the end
of device). systemd v234 was used for testing.

I see people online that have worked around this by setting up the ESP
(/boot) manually, and finalizing the install with 2x calls to
efibootmgr. But I'm hoping for bootctl to handle this for me :-)

Any ideas?

Best regards,
Bjørn Forsman
Andrei Borzenkov
2017-12-09 05:56:31 UTC
Permalink
Post by Bjørn Forsman
Hi all,
I assumed bootctl would be able to install onto a mdadm raid 1 array
$ bootctl --path=/mnt/boot install
Failed to probe partition scheme "/mnt/boot": Input/output error
The raid array is created with --metadata=0.90 (superblock at the end
of device). systemd v234 was used for testing.
I see people online that have worked around this by setting up the ESP
(/boot) manually, and finalizing the install with 2x calls to
efibootmgr. But I'm hoping for bootctl to handle this for me :-)
Any ideas?
Firmware is unaware of MD RAID and each partition is individually and
independently writable by firmware. Pretending that you can mirror them
on OS level is simply wrong.

Also each ESP requires own boot menu entry to be usable.

Having common cross-bootloader standard of representing and dealing with
redundant ESP would be good thing though. But mdadm is not one one of
them :)
Bjørn Forsman
2017-12-10 12:16:33 UTC
Permalink
Post by Andrei Borzenkov
[...]
Firmware is unaware of MD RAID and each partition is individually and
independently writable by firmware.
1. "Firmware is unaware of MD RAID". I agree.
2. "... independently writable by firmware". I don't expect firmware
to _write_ to the ESP (does it?!). As long as it only reads, nothing
will get out of sync.
Post by Andrei Borzenkov
Pretending that you can mirror them
on OS level is simply wrong.
I think that statement is correct for all md raid setups _except_
read-only access to raid 1 and with metadata 0.90 or 1.0 (superblock
at the end of device). Because in that case, a filesystem written on
the md array aligns with the underlying block device. So when the
system boots, EFI firmware can read /dev/sda1 and see the very same
filesystem that the OS put on /dev/md127.

Having ESP on an mdadm raid 1 array really works. (I now have a setup
of this myself.) But due to

$ bootctl --path=/mnt/boot install
Failed to probe partition scheme "/mnt/boot": Input/output error

, which my OS installer runs, it requires jumping through a few hoops
to get it running.

The hoops are:

1. Install OS with /dev/sda1 on /boot (no raid).
2. Setup /dev/md127 raid 1 on /dev/sdb1 with the 2nd device missing.
(May have to copy filesystem uuid from /dev/sda1 to /dev/md127.)
3. rsync filesystem contents from /dev/sda1 to /dev/md127.
5. Repurpose /dev/sda1 as the missing device in the /dev/md127 array
6. Use efibootmgr to create the 2nd boot entry, for /dev/sdb1.

I think these steps could be simplified/eliminated if "bootctl"
learned about mdadm (level 1) arrays.

Best regards,
Bjørn Forsman
Michael Chapman
2017-12-14 11:17:46 UTC
Permalink
Post by Bjørn Forsman
Post by Andrei Borzenkov
[...]
Firmware is unaware of MD RAID and each partition is individually and
independently writable by firmware.
1. "Firmware is unaware of MD RAID". I agree.
2. "... independently writable by firmware". I don't expect firmware
to _write_ to the ESP (does it?!). As long as it only reads, nothing
will get out of sync.
It's perhaps unlikely for firmware itself to write to the ESP, but
certainly anything launched from the firmware can. One of my boot entries
is an EFI shell, and it can move, copy, read and write files within the
ESP.

I think it's probably wise to avoid software RAID for the ESP.
Lennart Poettering
2017-12-14 11:22:36 UTC
Permalink
Post by Bjørn Forsman
Post by Andrei Borzenkov
[...]
Firmware is unaware of MD RAID and each partition is individually and
independently writable by firmware.
1. "Firmware is unaware of MD RAID". I agree.
2. "... independently writable by firmware". I don't expect firmware
to _write_ to the ESP (does it?!). As long as it only reads, nothing
will get out of sync.
It's perhaps unlikely for firmware itself to write to the ESP, but certainly
anything launched from the firmware can. One of my boot entries is an EFI
shell, and it can move, copy, read and write files within the ESP.
I think it's probably wise to avoid software RAID for the ESP.
I think so too. There has been work to teach sd-boot "boot attempt
counting", to make chrome-os-like automatic upgrading with safe
fallback when the system continously fails to boot available. That too
would store the counts in the file system.

Lennart
--
Lennart Poettering, Red Hat
Bjørn Forsman
2017-12-23 16:46:38 UTC
Permalink
Post by Lennart Poettering
It's perhaps unlikely for firmware itself to write to the ESP, but certainly
anything launched from the firmware can. One of my boot entries is an EFI
shell, and it can move, copy, read and write files within the ESP.
I think it's probably wise to avoid software RAID for the ESP.
I think so too. There has been work to teach sd-boot "boot attempt
counting", to make chrome-os-like automatic upgrading with safe
fallback when the system continously fails to boot available. That too
would store the counts in the file system.
Ok, there are things to look out for, but I don't think it's an
unreasonable setup. I want protection against disk crash and HW raid
is not available. What better option is there? (I've never had
firmware write to my boot disk / ESP (at least to my knowledge), so I
consider the risk of firmware messing up the SW raid to be very
small.)

Would bootctl patches be considered for inclusion?

Best regards,
Bjørn Forsman
Lennart Poettering
2018-01-23 19:22:05 UTC
Permalink
Post by Bjørn Forsman
Post by Lennart Poettering
It's perhaps unlikely for firmware itself to write to the ESP, but certainly
anything launched from the firmware can. One of my boot entries is an EFI
shell, and it can move, copy, read and write files within the ESP.
I think it's probably wise to avoid software RAID for the ESP.
I think so too. There has been work to teach sd-boot "boot attempt
counting", to make chrome-os-like automatic upgrading with safe
fallback when the system continously fails to boot available. That too
would store the counts in the file system.
Ok, there are things to look out for, but I don't think it's an
unreasonable setup. I want protection against disk crash and HW raid
is not available. What better option is there? (I've never had
firmware write to my boot disk / ESP (at least to my knowledge), so I
consider the risk of firmware messing up the SW raid to be very
small.)
Would bootctl patches be considered for inclusion?
Doing what precisely?

Lennart
--
Lennart Poettering, Red Hat
Bjørn Forsman
2018-01-23 20:19:06 UTC
Permalink
Hi Lennart,
Post by Lennart Poettering
Post by Bjørn Forsman
Would bootctl patches be considered for inclusion?
Doing what precisely?
Whatever is needed to support "bootctl install" on mdadm raid 1. What
that is, precisely, I don't know yet.

The idea of ESP on mdadm raid 1 was not received well in this thread.
So I figure I'd check if there is any possibility of such a feature to
be merged upstream before I spend any time on trying to implement it.

Best regards,
Bjørn Forsman
Lennart Poettering
2018-01-23 20:30:14 UTC
Permalink
Post by Bjørn Forsman
Hi Lennart,
Post by Lennart Poettering
Post by Bjørn Forsman
Would bootctl patches be considered for inclusion?
Doing what precisely?
Whatever is needed to support "bootctl install" on mdadm raid 1. What
that is, precisely, I don't know yet.
The idea of ESP on mdadm raid 1 was not received well in this thread.
So I figure I'd check if there is any possibility of such a feature to
be merged upstream before I spend any time on trying to implement it.
Well, it all depends what exactly you want to add there and if it's
small and isolated enough.

Lennart
--
Lennart Poettering, Red Hat
Lennart Poettering
2017-12-11 22:59:11 UTC
Permalink
Post by Bjørn Forsman
Hi all,
I assumed bootctl would be able to install onto a mdadm raid 1 array
$ bootctl --path=/mnt/boot install
Failed to probe partition scheme "/mnt/boot": Input/output error
The raid array is created with --metadata=0.90 (superblock at the end
of device). systemd v234 was used for testing.
I see people online that have worked around this by setting up the ESP
(/boot) manually, and finalizing the install with 2x calls to
efibootmgr. But I'm hoping for bootctl to handle this for me :-)
Any ideas?
Hmm, we simply use libblkid on the block device, and validate that
everything is in order, i.e. has a GPT disk label, and all the right
UUIDs and so on. It's very simple code. If that doesn't work, then
either your setup is borked or most likely the bug is in libblkid.

We ultimately don't care much what the backing block device really is,
as long as it exposes a GPT partition table and the kernel exposes
proper per-partition block devices.

You can check if blkid works properly by running:

# blkid -p /dev/sda1
/dev/sda1: LABEL="SYSTEM" UUID="1234-5678" VERSION="FAT32" TYPE="vfat"
USAGE="filesystem" PART_ENTRY_SCHEME="gpt" PART_ENTRY_NAME="EFI System
Partition" PART_ENTRY_UUID="12345678-1234-1234-1234-123456789abc"
PART_ENTRY_TYPE="c12a7328-f81f-11d2-ba4b-00a0c93ec93b"
PART_ENTRY_FLAGS="0x1" PART_ENTRY_NUMBER="1" PART_ENTRY_OFFSET="2048"
PART_ENTRY_SIZE="532480" PART_ENTRY_DISK="8:0"

we need at least the fields PART_ENTRY_TYPE=, PART_ENTRY_SIZE=,
PART_ENTRY_OFFSET=, PART_ENTRY_NUMBER=, PART_ENTRY_UUID=,
PART_ENTRY_SCHEME= and TYPE= of these. If they are missing, then
either your setup is bad, or blkid confused.

That all said, unless mdadm operates with exactly zero header and
footer on disk I doubt this will ever work and be compatible with
EFI. But then again, I have no clue about mdadm...

Good luck,

Lennart
--
Lennart Poettering, Red Hat
D.S. Ljungmark
2017-12-14 10:44:13 UTC
Permalink
I seem to have the same problem, and here's the output:

[***@spring ~]# blkid -p /dev/sda1
/dev/sda1:
UUID="01c0c70f-9204-8e4e-f2a7-aa8ec14c4a41"
UUID_SUB="2a820238-597c-bfd4-aa2e-19425f7e8fa4"
LABEL="spring.skuggor.se:0" VERSION="1.0"
TYPE="linux_raid_member" USAGE="raid"
PART_ENTRY_SCHEME="gpt"
PART_ENTRY_NAME="EFI System Partition"
PART_ENTRY_UUID="98b45cd9-11f8-402a-8c89-a4e833581446"
PART_ENTRY_TYPE="c12a7328-f81f-11d2-ba4b-00a0c93ec93b"
PART_ENTRY_NUMBER="1"
PART_ENTRY_OFFSET="2048"
PART_ENTRY_SIZE="409600"
PART_ENTRY_DISK="8:0"


The machine is 6 disks, raid1 on /boot, in mdadm format 1.0 (
metadata at the end of the device) which allows each partition to be
mounted (read only) as itself.



On Mon, Dec 11, 2017 at 11:59 PM, Lennart Poettering
Post by Lennart Poettering
Post by Bjørn Forsman
Hi all,
I assumed bootctl would be able to install onto a mdadm raid 1 array
$ bootctl --path=/mnt/boot install
Failed to probe partition scheme "/mnt/boot": Input/output error
The raid array is created with --metadata=0.90 (superblock at the end
of device). systemd v234 was used for testing.
I see people online that have worked around this by setting up the ESP
(/boot) manually, and finalizing the install with 2x calls to
efibootmgr. But I'm hoping for bootctl to handle this for me :-)
Any ideas?
Hmm, we simply use libblkid on the block device, and validate that
everything is in order, i.e. has a GPT disk label, and all the right
UUIDs and so on. It's very simple code. If that doesn't work, then
either your setup is borked or most likely the bug is in libblkid.
We ultimately don't care much what the backing block device really is,
as long as it exposes a GPT partition table and the kernel exposes
proper per-partition block devices.
# blkid -p /dev/sda1
/dev/sda1: LABEL="SYSTEM" UUID="1234-5678" VERSION="FAT32" TYPE="vfat"
USAGE="filesystem" PART_ENTRY_SCHEME="gpt" PART_ENTRY_NAME="EFI System
Partition" PART_ENTRY_UUID="12345678-1234-1234-1234-123456789abc"
PART_ENTRY_TYPE="c12a7328-f81f-11d2-ba4b-00a0c93ec93b"
PART_ENTRY_FLAGS="0x1" PART_ENTRY_NUMBER="1" PART_ENTRY_OFFSET="2048"
PART_ENTRY_SIZE="532480" PART_ENTRY_DISK="8:0"
we need at least the fields PART_ENTRY_TYPE=, PART_ENTRY_SIZE=,
PART_ENTRY_OFFSET=, PART_ENTRY_NUMBER=, PART_ENTRY_UUID=,
PART_ENTRY_SCHEME= and TYPE= of these. If they are missing, then
either your setup is bad, or blkid confused.
That all said, unless mdadm operates with exactly zero header and
footer on disk I doubt this will ever work and be compatible with
EFI. But then again, I have no clue about mdadm...
Good luck,
Lennart
--
Lennart Poettering, Red Hat
_______________________________________________
systemd-devel mailing list
https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
8362 CB14 98AD 11EF CEB6 FA81 FCC3 7674 449E 3CFC
Lennart Poettering
2017-12-14 10:53:50 UTC
Permalink
Well, I presume you are mounting /mnt/boot through the MD layer,
right? Hence you need to check /dev/md0p1 or something
instead. /dev/sd1 is the name of the partition without MD...

Lennart
--
Lennart Poettering, Red Hat
D.S. Ljungmark
2017-12-14 10:55:41 UTC
Permalink
Right, and _that_ causes a far less interesting output from blkid.

/dev/md0: UUID="1492-1FE7" VERSION="FAT32" TYPE="vfat" USAGE="filesystem"

On Thu, Dec 14, 2017 at 11:53 AM, Lennart Poettering
Post by Lennart Poettering
Well, I presume you are mounting /mnt/boot through the MD layer,
right? Hence you need to check /dev/md0p1 or something
instead. /dev/sd1 is the name of the partition without MD...
Lennart
--
Lennart Poettering, Red Hat
--
8362 CB14 98AD 11EF CEB6 FA81 FCC3 7674 449E 3CFC
Lennart Poettering
2017-12-14 11:11:29 UTC
Permalink
Post by D.S. Ljungmark
Right, and _that_ causes a far less interesting output from blkid.
/dev/md0: UUID="1492-1FE7" VERSION="FAT32" TYPE="vfat"
USAGE="filesystem"
uh, why does it output "/dev/md0" as device for this? did you invoke
it for /dev/md0p1 or for /dev/md0?

Lennart
--
Lennart Poettering, Red Hat
Andrei Borzenkov
2017-12-15 13:26:24 UTC
Permalink
On Thu, Dec 14, 2017 at 2:11 PM, Lennart Poettering
Post by Lennart Poettering
Post by D.S. Ljungmark
Right, and _that_ causes a far less interesting output from blkid.
/dev/md0: UUID="1492-1FE7" VERSION="FAT32" TYPE="vfat"
USAGE="filesystem"
uh, why does it output "/dev/md0" as device for this? did you invoke
it for /dev/md0p1 or for /dev/md0?
This thread is about pure software MD RAID over partitions.
Partitioned MD RAID usually is associated with firmware array and
likely is handled correctly by firmware as single logical disk, so
should for practical purposes be fine.
Loading...