Fail to deploy with Bcache when using cache multiple times

Bug #1514094 reported by Mark Shuttleworth
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Critical
Unassigned
curtin
Fix Released
Critical
Unassigned

Bug Description

Hello, wanting to try out some of the new bcache love and got this error:

UUID: 0819d505-f01a-4866-8b4c-c406dfa30693
Set UUID: b080425e-7091-4531-a04d-245f8b9c1753
version: 0
nbuckets: 488390
block_size: 1
bucket_size: 1024
nr_in_set: 1
nr_this_dev: 0
first_bucket: 1
UUID: 311421db-a1f0-4c51-b3eb-082843c4c185
Set UUID: b080425e-7091-4531-a04d-245f8b9c1753
version: 1
block_size: 1
data_offset: 16
Error: /dev/bcache0: unrecognised disk label
Error: /dev/bcache0: unrecognised disk label
Can't open dev /dev/sdd1: Device or resource busy
An error occured handling 'bcache1': ProcessExecutionError - Unexpected error while running command.
Command: ['make-bcache', '-B', '/dev/sda2', '-C', '/dev/sdd1']
Exit code: 1
Reason: -
Stdout: ''
Stderr: ''
Unexpected error while running command.
Command: ['make-bcache', '-B', '/dev/sda2', '-C', '/dev/sdd1']
Exit code: 1
Reason: -
Stdout: ''
Stderr: ''
Stderr: ''

I was trying to do a /boot partition as sda1 and then have / as a 1TB bcache0 made of sda2, with bcache1 made of 2TB sda3, bcache2 out of 3TB sdb, and bcache3 out of 3TB sdc.

Related branches

Changed in maas:
importance: Undecided → Critical
milestone: none → 1.9.0
Revision history for this message
Blake Rouse (blake-rouse) wrote :

The issue here is that curtin is not handling the ability for bcache to use the same cache device multiple times.

The command "make-bcache -B /dev/sda2 -C /dev/sdd1" failed because "/dev/sdd1" was already made a cache device. Curtin should check to see if "/dev/sdd1" was already made a cache device in a previous call. If so then it should not add the "-C /dev/sdd1" to the command.

Then curtin needs to take the created bcache device and attach it to the already created bcache cache set for that device.

Changed in maas:
status: New → Incomplete
status: Incomplete → Opinion
status: Opinion → Triaged
Changed in curtin:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Mark Shuttleworth (sabdfl) wrote : Re: [Bug 1514094] Re: bcache setup fails in gmaas

Right - to be clear using "one or two bcache devices to cache multiple
disks" is our standard recommended approach, so yes Curtin should
support this perfectly.

Mark

Revision history for this message
Ryan Harper (raharper) wrote : Re: bcache setup fails in gmaas

Linked to MR with fix.

Changed in curtin:
status: Triaged → In Progress
Changed in maas:
status: Triaged → Invalid
summary: - bcache setup fails in gmaas
+ Fail to deploy with Bcache when using cache multiple times
Revision history for this message
Scott Moser (smoser) wrote :

fix-committed in 304. 309 or later should be good.

Changed in curtin:
status: In Progress → Fix Committed
Revision history for this message
Blake Rouse (blake-rouse) wrote :

I am using bzr314 and I still have the same issue where it will not deploy.

Here is the storage configuration:
http://paste.ubuntu.com/13480912/

Here is the installation output:
http://paste.ubuntu.com/13480985/

Changed in curtin:
status: Fix Committed → New
Revision history for this message
Blake Rouse (blake-rouse) wrote :

I think the issue here is that curtin uses the id field as an identifier for the created object without actually checking the created objects name. The "id" in the preseed should just reference to that operation and not to the actual object. Just because the "id" is "bcache1" doesn't mean that the created bcache path might be "bcache0" in that regard the "id" could just be "mydisk".

Curtin should keep an internal map of "id" to created device on the machine.

1. Perform "make-bcache".
2. Find created bcache name. (ex. /dev/bcache0)
3. Update internal map with [id] = "/dev/bcache0"

Now other operations can reference "id" and curtin will use "/dev/bcache0". Now it looks like curtin assumes "/dev/{id}" which is very wrong.

Revision history for this message
Mark Shuttleworth (sabdfl) wrote : Re: [Bug 1514094] Re: Fail to deploy with Bcache when using cache multiple times

The use of an id which is close, but not the same as, a device name is
just begging for confusion on all fronts.

So, either we need to specify the actual device identity ("bcache1") or
we need to use something totally different in order to avoid confusion.

Anyhow, guys, this suggests that curtin & maas contributors are not
working closely enough together to spot and catch problems; this was
marked fixed but clearly never tested in anger.

Mark

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Taking over from Scott/Ryan as we are less "thankfully giving" these days in
Europe.

Looking at the testcase we had for bcache, the description blake_r was giving
us and some "mean tester thinking" I ended up adding:
- non ordered ids 1, 0 (base on blake_r)
- caching with a partition-less device (base on blake_r)
- backing a partition-less device
- use all different cache modes
I also thought on stacking, but they are discouraged reading upstream, so I
skipped those for now.
- backing with a bcache device (stacking)
- caching with a bcache device (stacking)

We might also add this further on:
- putting / on bcached device (base on blake_r)

Recreation of the issue worked - commented serial of install log:
http://paste.ubuntu.com/13495607/
(look for CE: for my comments)

With the debugging commented in the log the fix looked rather simple
(as always when you know whats going on)
- '/sys/block/{}/bcache/cache_mode'.format(info.get('id'))
+ '/sys/block/{}/bcache/cache_mode'.format(bcache_dev)

The enhanced test I defined fully ran through installation with this fix.
I was also able to adapt all the assertions of the test to cover the new
definitions properly - and they work even in this more complex setup.

Later I also added the "/ on bcache" to further enhance the testcase, but
this is failing on grub install now.
That means this is not final yet, but at some point I have to sleep :-)

MAAS Team, if you can feel free to help me testing using this branch
=> lp:~paelzer/curtin/fix-1514094-v1
If you stop at:
"Command: ['install-grub', '/tmp/tmp8mjp2u0k/target', '/dev/bcache2']"
this is what I'm debugging atm anyway - anything else please let me know.

If I hear nothing I'll let you all know once I'm happy with the extended
testing.

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Christian,

I will test your branch and provide an update. As for the grub install, grub should not be installed to the bcache device, it should be installed to the backing device. The backing device also requires that a partition table be on that backing device. All of this is handled in MAAS.

1. If you select the boot disk as the backing device MAAS will automatically create the partition table and partition and use the partition as the backing device. (This is because grub-install should be performed on /dev/sda instead of /dev/bcache0, and /dev/sda requires a partition table.)
2. MAAS will not allow you to deploy a node that has a bcache as the "/" unless you have a "/boot" on a none bcache device. This is because grub cannot load the kernel from a bcache backing device.

This is all handled in MAAS, and I don't think you need to handle the case for "grub-install /dev/bcache2" that just seems wrong as that should be performed on its backing devices. (eg. /dev/sda)

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Christian,

I tested your branch and got the same error.

http://paste.ubuntu.com/13498705/

Here is the config:

http://paste.ubuntu.com/13498710/

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

So While for me a test formerly triggering the bug was fixed for you it essentially turned from:
  An error occured handling 'bcache1': IOError - [Errno 2] No such file or directory: '/sys/block/bcache1/bcache/cache_mode'
into
  An error occured handling 'bcache1': IOError - [Errno 22] Invalid argument

As Ryan suggested, it would be great if you could add "-vvv" when running in debug mode.
As it will help identifying what exactly failed - like reporting the actual failing command.
*just discussed with Blake on IRC - will be possible*

I made the area around the issue a bit more verbose in my latest push to lp:~paelzer/curtin/fix-1514094-v1 which will trigger enabling -vvv.

Revision history for this message
Blake Rouse (blake-rouse) wrote :

I have updated to your latest branch and performed the installation with verbose, using the same curtin config from my previous comment. Still failed, don't know if that was expected or not.

Output: http://paste.ubuntu.com/13502877/

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Blake,
as discussed on IRC it was meant to fail to provide more insight via better debugging info.
I think I found the root cause of your last reported bug. It only triggered on older kernel bcache implementations which was why it was harder to spot at first.

I tested the new code on vivid/wily.
I also was able to extend the vmtest to trusty (still shaky). I might hit non related issues with that.
(extending coverage to various release/hwe combinations is an ongoing effort of the last weeks, but often has to be postponed for other more urgent things).

Therefore I need you for the MAAS/Trusty environment that you are running in.
Starting with version 325 in branch lp:~paelzer/curtin/fix-1514094-v1 the fix is in.

The former issue you had should be fixed, please test it and let me know the results in your environment.
If you are running in any of these issues it is "good" as those are the ones I currently can recreate:
 * bcache: __cached_dev_store() Can't attach a49890b1-2103-4e2e-a47d-cc56558e8f7d: cache set not found
 * Stderr: u'Already a bcache device on /dev/vdb6, overwrite with --wipe-bcache\n'
 * An error occured handling 'bcache_normal': ValueError - Invalid number of holding devices: []

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Tested the new code. As expected it still failed. Output is below:

http://paste.ubuntu.com/13505467/

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks Blake for the test in the real MAAS environment.

As mentioned yesterday I was able to reproduce those and was working on them this morning.
- I found an issues which didn't show up before as part of it is based on a race with older levels of the userspace toolchain.
- There was an issue - again only with older userspace toolchains - which only triggered if the backing device was set up when the caching device was not yet fully initialized
- I also found that there was a potential issue of a yaml changing cache mode on a cache device (those are set per backing device) which would have caused a non insightful "IOError". This now fails more gracefully with a proper message.

All those issues were fixed in my branch (lp:~paelzer/curtin/fix-1514094-v1) as of revision 328 and along all that I further extended the debug output of these code paths.

Since it has shown races I ran my tests 18 times over Lunch and all worked.
It would be great if you could put it to its trial in your environment once more to see if there is more to do.

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

On 26/11/15 12:06, ChristianEhrhardt wrote:
> - I found an issues which didn't show up before as part of it is based on a race with older levels of the userspace toolchain.
> - There was an issue - again only with older userspace toolchains - which only triggered if the backing device was set up when the caching device was not yet fully initialized

Does it make sense to backport a clean and reliable version of these
userspace tools to 14.04?

Mark

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

> Does it make sense to backport a clean and reliable version of these
> userspace tools to 14.04?
>
> Mark

Hi Mark,
I thought about it and I guess it would be reasonable if it would be just the bcache userspace tools.
But the bcache documentation states "... If util-linux's libblkid is sufficiently recent (2.24) ...".
And we have only 2.20.1 in trusty (.25/.26/.27 in V/W/X).

So even on the first check it appears being too much dependencies for an SRU "just for that".
And obviously there could be even more dependencies.

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

I think it's worth looking into.

Bcache is a key piece of infrastructure we want everyone to use. It is
easier for us to support it if it is consistent everywhere. I suspect
the needed updated to things like libblkid are provably backwards
compatible, and supported already, so worth doing.

Mark

Changed in curtin:
status: New → Fix Committed
Revision history for this message
Zoltan Arnold Nagy (zoltan) wrote :

Should I see this fix in 1.9.0? A new deployment is failing with "modprobe bcache" errors...

Changed in maas:
milestone: 1.9.0 → none
Revision history for this message
Scott Moser (smoser) wrote : Fixed in Curtin 17.1

This bug is believed to be fixed in curtin in 17.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in curtin:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.