When MAAS Boot Images are Superseded, Disk Space is not Reclaimed

Bug #1459876 reported by Andres Rodriguez
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Mike Pontillo

Bug Description

1G ./var/lib/postgresql/9.3/main/pg_multixact/members
1G ./var/lib/postgresql/9.3/main/pg_multixact/offsets
1G ./var/lib/postgresql/9.3/main/pg_multixact
1G ./var/lib/postgresql/9.3/main/pg_xlog/archive_status
1G ./var/lib/postgresql/9.3/main/pg_xlog
1G ./var/lib/postgresql/9.3/main/pg_stat_tmp
1G ./var/lib/postgresql/9.3/main/pg_tblspc
1G ./var/lib/postgresql/9.3/main/pg_serial
1G ./var/lib/postgresql/9.3/main/pg_clog
1G ./var/lib/postgresql/9.3/main/base/12061
56G ./var/lib/postgresql/9.3/main/base/16385
1G ./var/lib/postgresql/9.3/main/base/1
1G ./var/lib/postgresql/9.3/main/base/12066
56G ./var/lib/postgresql/9.3/main/base
1G ./var/lib/postgresql/9.3/main/global
1G ./var/lib/postgresql/9.3/main/pg_stat
1G ./var/lib/postgresql/9.3/main/pg_twophase
1G ./var/lib/postgresql/9.3/main/pg_notify
1G ./var/lib/postgresql/9.3/main/pg_snapshots
1G ./var/lib/postgresql/9.3/main/pg_subtrans
56G ./var/lib/postgresql/9.3/main
56G ./var/lib/postgresql/9.3
56G ./var/lib/postgresql

Related branches

Changed in maas:
milestone: none → 1.8.0
importance: Undecided → Critical
Revision history for this message
Andres Rodriguez (andreserl) wrote :

maasdb=# SELECT pg_database_size('maasdb');
 pg_database_size
------------------
      59554113720
(1 row)

Revision history for this message
Andres Rodriguez (andreserl) wrote :

maasdb=# SELECT pg_size_pretty(pg_database_size('maasdb'));
 pg_size_pretty
----------------
 55 GB
(1 row)

Changed in maas:
status: New → Incomplete
Revision history for this message
Mike Pontillo (mpontillo) wrote :

Marking this incomplete because I don't think we have enough information to diagnose the problem in this particular case.

From my investigation, this seems like a postgresql issue rather than a MAAS issue. On my system, postgresql was using ~2.5GB of space. But if I run "maas-region-admin dbshell --installed" and then use the "vacuum full verbose;" postgresql command, the disk usage drops to a few hundred megabytes.

If you haven't vacuumed this database yet, can you run "vacuum full verbose;" at the SQL prompt and attach the output to this bug?

It would also be useful to get the output of the following SQL commands:

select sum(total_size) from maasserver_largefile;
 select bsf.created, filename, total_size from maasserver_bootresourcefile bsf, maasserver_largefile lf where bsf.largefile_id=lf.id;

Recommendation: install a cron job to periodically vacuum the postgresql database.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Looking into this a bit more, it seems that postgresql has an "autovacuum" process which is supposed to handle this. Unfortunately, it doesn't appear to do "full" vacuums.

The need for "full" vacuums seem to be a side effect of storing (frequently-updated, in the of daily images?) huge files in the database.

It would also be good to find out how long a "vacuum full" takes; we might need to schedule a maintenance period for doing this if the time is significant.

If this becomes an issue, I think it might be worth revisiting the design decision to store these large files in the database. 'rsync' might do the job just as well. ;-)

Revision history for this message
Mike Pontillo (mpontillo) wrote :

See also:

http://www.postgresql.org/docs/9.3/static/sql-vacuum.html

I have not tested this, but we *should* be able to run "vacuum full verbose maasserver_largefile" to do the full vacuum without locking the entire database (though it would still lock the maasserver_largefile table). Please try this and post the output if you have not already; I want to see if it helps in your case.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

I was thinking it might be possible to hack as follows:

http://paste.ubuntu.com/11426475/

But, as was pointed out earlier:

(1) The "VACUUM FULL" command could take a long time
(2) It isn't clear if "VACUUM FULL" can be used within this context. (we would also need to hack it to get ourselves out of the transaction)

As an alternative, we could shell out to run a CLI command such as (tested while running as the 'maas' user):

$ vacuumdb -U maas -d maasdb -t maasserver_largefile --full -v
INFO: vacuuming "public.maasserver_largefile"
INFO: "maasserver_largefile": found 0 removable, 5 nonremovable row versions in 1 pages
DETAIL: 0 dead row versions cannot be removed yet.
CPU 0.00s/0.00u sec elapsed 0.00 sec.

[1]: https://stackoverflow.com/questions/1017463/postgresql-how-to-run-vacuum-from-code-outside-transaction-block

Changed in maas:
status: Incomplete → Triaged
Revision history for this message
Gavin Panella (allenap) wrote : Re: [Bug 1459876] Re: MAAS image syncing does not seem to be deleting old images from DB

> If this becomes an issue, I think it might be worth revisiting the
> design decision to store these large files in the database. 'rsync'
> might do the job just as well. ;-)

Just on this point: if we were running MAAS in a more HA-like
configuration, where maas-regiond runs on multiple machines, then we'd
have the problem of where to put these large files. We could do
something like:

  1. A maas-regiond process wakes up.

  2. If a sync has not been done in more than N minutes, it grabs an
  i-am-syncing lock.

  3. If that fails, sleep for a while and try again.

  4. Use rsync to update the local copy of all image files.

  At this point there's an old set of images and a new set. They'll
  probably intersect to a large extent. However, clusters may be syncing
  the old set at this time so we must not delete any files from the old
  set yet.

  5. Use rsync to push the local image files to all of the neighbouring
  machines running maas-regiond.

  6. Update the database so that clusters can see the new image files.

  7. Wait until no clusters are syncing the old set of image files.

  8. Delete the old images that are not part of the new set.

That's quite complex. Gracefully handling failure conditions would make
it more so. For example, what to do when adding a new machine on which
you're going to run maas-regiond, or when re-adding a machine that has
been out of service for a while.

IIRC, the above is roughly the thought process that brought us to using
the database originally. If we can't resolve the usage issues within the
database, and we're forced to consider alternatives, a shared filesystem
(like NFS) or an object store (like Swift or Ceph) would, I posit, be
the direction we should go in.

Revision history for this message
Gavin Panella (allenap) wrote : Re: MAAS image syncing does not seem to be deleting old images from DB

I don't think organising vacuums is MAAS's responsibility, but we ought to document this.

Something else to consider is the vacuumlo utility in the postgresql-contrib-9.4 package: "vacuumlo removes unreferenced large objects from databases".

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Perhaps we should consider running the "vacuum full" as part as the packaging scripts, so that any time you upgrade, these problems go away. ;-)

It's worth noting that the 'releases' channel shouldn't change very often, so the only people who should see this regularly are those using the 'daily' images. Which probably isn't people using MAAS in production.

Changed in maas:
status: Triaged → Fix Committed
Revision history for this message
Mike Pontillo (mpontillo) wrote :

This needs to be in the release notes.

Changed in maas:
assignee: nobody → Mike Pontillo (mpontillo)
Revision history for this message
Mike Pontillo (mpontillo) wrote :

For the record, it seems to consistently take ~10 seconds per stale image to vacuum the database.

A database with ~20 GB of stale images took ~5.5 minutes to vacuum.

So we might want to shy away from doing this on an automated basis.

summary: - MAAS image syncing does not seem to be deleting old images from DB
+ When MAAS Boot Images are Superseded, Disk Space is not Reclaimed
Revision history for this message
Mike Pontillo (mpontillo) wrote :

Also, for the record: we tried the 'vacuumlo' utility from the 'postgresql-contrib' package, and there seems to be a bug in it. (it had no effect.)

Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
Stuart Bishop (stub) wrote :

Normal vacuum (and autovacuum) does not reclaim disk space. It just flags rows in the table that have been deleted for reuse. Future writes to the table will reuse it.

You don't want to do regular VACUUM FULLs if you can avoid it. Its a band aid, and it is never recommended for use on production systems. If VACUUM FULL is reclaiming a large amount of space (20-30% is normal bloat), then the solution is to stop the table getting bloated in the first place.

The usual culprit for causing bloat are long running transactions - if psql process or badly behaved script/app/metrics/monitoring holds open a transaction for a few days, none of the data deleted in that period can be flagged for reuse by autovacuum.

autovacuum can be tuned to be more aggressive. By default, it doesn't even kick in until 20% of the table and at least 50 rows has been updated. On most systems I normally tune this to 2% (autovacuum_vacuum_scale_factor = 0.02). This could be very significant - if you insert the minimum of 50 large objects before an automatic vacuum can kick in, that is 200GB worth of 4GB images. If I'm correct here, we either need to run a normal 'vacuum pg_largeobject' regularly (should be fast, so even every few mins would be ok), or tune the default vacuum settings in postgresql.conf. Unfortunately, we can't tune the parameters on just the pg_largeobject table because it is a system table.

Revision history for this message
Gavin Panella (allenap) wrote :

When rapidly syncing and unsyncing images in MAAS I observed the
following:

- Sync 16.04. Database (/var/lib/postgresql) is 1.6GB.

- Sync 16.10. Database grows to 2.0GB.

- Unsync 16.10. Database stays at 2.0GB.

- Sync 16.10 again. Database grows to 2.4GB.

- Unsync 16.10 again. Database stays at 2.4GB.

- Vacuum, NON-full. Database shrinks back to 1.6GB.

- Do the sync/unsync thing to get the database up to 2.5GB (it appears
  to have accumulated ~100MB of fat).

- Sync 17.04. Database grows to 2.8GB.

- Unsync *16.10*. Database stays at 2.8GB.

- Vacuum, NON-full. Database STAYS at 2.8GB.

- Sync 16.10 again. Database stays at 2.8GB.

- FULL vacuum. Database DROPS to 2.4GB.

To summarise:

- With non-full vacuuming, PostgreSQL is not reclaiming disk space for
  deleted rows in the _middle_ of the large object store. It will reuse
  that space for new objects, but will not free it to the OS until a
  FULL vacuum is performed.

- A non-full vacuum DOES reclaim disk space for deleted rows at the
  _end_ of the object store.

- Without any vacuuming, the large object store will grow without bound.
  This seems at odds with our experience in MAAS, and that's because,
  contrary to our discussions in the team recently, autovacuum _is_
  enabled by default.

This essentially confirms what Stuart has said, but I'll add some more
MAAS-specific flesh to his suggestions:

- Most MAAS installations should be _okay_ without modification, but
  there's evidence that the autovacuum process is too timid for very
  active MAAS installations. On installation, MAAS could tweak the
  autovacuum settings to make it more aggressive, and/or execute
  non-full vacuums from time to time on pg_largeobject.

- Give admins a button to "Reclaim disk space" on the images page. We
  may be able to calculate the amount of disk space we can release and
  only show the button when it's more than, say, 25%.

  This would run a full vacuum on pg_largeobject and therefore be
  potentially disruptive.

  In the meantime we can direct admins to the db_vacuum_lobjects command
  that comes with MAAS.

- Automatically run a non-full vacuum on pg_largeobject as the last step
  in an image sync, to ensure that unused storage will be reused on the
  very next sync, which might occur before the autovacuumer gets to it.

- Automatically run a full vacuum on pg_largeobject just after a sync,
  to always release unused storage to the OS. An exclusive advisory lock
  is held by the controlling process as part of MAAS's sync logic — i.e.
  we know when pg_largeobject (which we use only for image storage) will
  be quiet and safe to be fully vacuumed.

Revision history for this message
Gavin Panella (allenap) wrote :

As suggested elsewhere, we could/should also run `vacuumlo` periodically
to delete any orphaned large objects, or a MAAS-specific equivalent:

  DELETE FROM pg_largeobject
   WHERE loid NOT IN (SELECT content FROM maasserver_largefile)

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Thanks Gavin; nice sleuthing.

I would warn against executing that last SQL statement in comment #15 though; that would be hazardous and destructive if any non-MAAS applications were also using postgresql.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

One thing I think we should look into is changing PostgreSQL to use fallocate() with FALLOC_FL_PUNCH_HOLE, in order to free the space on the filesystem without strictly making the file smaller.

I did some initial investigation on this, and found where large objects are dropped in PostgreSQL[1], but haven't figured out if this is possible given the mechanism PostgreSQL uses to synchronize to disk. (I couldn't figure out the relationship between the storage subsystem in `md.c` and the backend code which seems to update the in-memory heap; I'm not sure how the heap updates ultimately translate to file I/O.)

[1]:
https://doxygen.postgresql.org/pg__largeobject_8c_source.html

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.