sbackup is very slow when working on many files

Bug #102577 reported by Oliver Gerlich
4
Affects Status Importance Assigned to Milestone
sbackup (Ubuntu)
Fix Released
Undecided
Oumar Aziz OUATTARA

Bug Description

Binary package hint: sbackup

When backing up a large directory structure (in my case: ~ 280000 files, probably many directories, ~ 10 GB of data all in all), sbackup seems to get slower and slower while creating the list of files (i.e. creating the fprops file)... It can be seen how the fprops file grows by a few kb every few seconds.
After creation of fprops file, actual tar'ing and gzip'ing doesn't show any strange performance (most CPU time goes to gzip, and some goes to I/O wait, as expected).

Details:
Creating the fprops file took 280 minutes in the end; that means it has handled 17 files per second. At the beginning of backup, I had measured a speed of about 60 files / second. During fprops creation, the HDD led didn't light up often, and top showed that all CPU load was used by the sbackupd process - so the slowdown is apparently not disk bound. Also, when testing backup just with the default config (i.e. backing up the system files) the first run took maybe a minute, and an incremental backup took around 15 seconds... The machine is a Pentium 1 with 400 MHz, with 128 MB RAM, a 8 GB system disk and a 40 GB data disk.

This happened with sbackup from Dapper (0.9-1 IIRC) and also with the newest release (0.10.3-0.1).

Maybe there is a O(n^2) action in the code? I've seen a line like "if not parent in dirs_in ..." which seems to search for every file name in the list (map?) of known files - so this operation probably gets slower, the more files are added. Is there a way to profile this?

Revision history for this message
Oumar Aziz OUATTARA (wattazoum) wrote :

It might be interesting to look at http://www.cs.uoregon.edu/research/tau/home.php

Revision history for this message
Oumar Aziz OUATTARA (wattazoum) wrote :

Could you please send the list of the dirs to include in the backup just like you specified them into simple-backup-config ? If the names of those are confidential, just give them another name. What I would like to have is the directory structure.

Revision history for this message
Oliver Gerlich (ogerlich) wrote :

Unfortunately I can't indeed send you the directory names, and don't know a way to create an anonimized directory list. But attached is a script which should create a similar structure: 420 directories at all (in multiple levels), and 280000 files in the lowest directories. The files are only 13 bytes each (1.1 GB at all), while the original files were around 2 kb each I think... But I guess the most important factor is the number of files and dirs.

I will run sbackup on this directory hierarchy tonight and see how long it takes and if this faithfully reproduces the problem.

Regarding profiling: isn't there some very simple profiling module for starting? Just to see roughly where the time is spent? Maybe even some fine-grained debug messages (more like trace messages) with timestamps could be useful.

Revision history for this message
Oumar Aziz OUATTARA (wattazoum) wrote :

regarding the profiling see that page : http://docs.python.org/lib/profile.html

Since Aigars is not actually free for working on this, I don't know if he can take care of that.
Basically I think that kind of problem could be avoid if we could retreive the flist and fprops information from the backup file.tgz . TAR has an option for that. That means :
 * make the backups
 * and then create the list of the backup files (from the files.tgz file)
The actual behavior is that :
 * sbackup surfs between files to create the flist and fprops
 * then makes the backup.

The problem with the way I propose doing things is remote backup. see Bug #89457 . That might happen while creating the flist.

Revision history for this message
Oliver Gerlich (ogerlich) wrote :

Attached is some kind of logfile from backing up the 280000 files (it's the output of "ls -l" on the target directory, once a minute). This shows that creating the flist file took around 70 minutes, which is way faster than the original case; but now this was run on a much newer machine (K7 2600+, nforce2, 1GB RAM). The actual tar'ring only took few minutes it seems. So I guess it still uses too much time for the flist creation. Now one could also try to display the flist file growth over time (and would probably see that the growth would slow down - that's what I saw on the original machine).

Regarding the idea to create the flist from the tgz: I thought the flist is also created to see what files actually changed and so need to be included in the incremental copy... Would that still work when only tar is used?

I guess the basic algorithm in sbackup is good, but there are probably a few operations in there which get very expensive when dealing with much data.

Revision history for this message
Oumar Aziz OUATTARA (wattazoum) wrote :

Hi,

Actually the TAR tool has a built in backup (with incremental support) system. It creates a file just looking like a mixed of the flist and fprops file. Look at some documentation here : http://www.gnu.org/software/tar/manual/tar.html#Backups.
The format of both file are quite the same.
- TAR : xxxxx.file_name[separator]xxxxx.file_name[separator] ... where xxxxxx are the properties
- Sbackup :
  - flist : file_name[separator]file_name[separator]file_name[separator]....
  - fprops : xxxxx.xxxxx.xxxxx.....
   and it uses the separators to make the correspondences.

The point is to know if there was a special reason for sbackup to use this kind of formating.

Revision history for this message
Oliver Gerlich (ogerlich) wrote :

So after some longer pause I looked into this (and into Python) again and noticed that dirs_in is indeed a list, instead of a hash (dictionary), so do_add_file() will inefficiently search through all 200.000+ entries instead of doing a fast hash-lookup. Changing the three occurrences of dirs_in to a hash speeds up the first full backup of the test case from about 75 minutes to about 5 minutes; fprops creation only takes about one minute now instead of 70 minutes. Attached is a patch - could you review it?

The incremental backup is taking very long (still). Maybe there's another problem yet (or the hash patch has worsened the incremental case :)

Revision history for this message
Aigars Mahinovs (aigarius) wrote :

I love your results there. I and Aziz will check that patch and it will surely go into the next release if no negative sideeffects can be found. You could take a look a code block that starts with "if listing = []:". It fills up the information about previously backuped files into prev. The other culprid could be the use of prev.count() in function do_backup(). I am not sure why I used it there, but it could be that I forgot about "object in list" syntax and the one I tried to use did not cope very well with things missing from the list.
Try changing that and looking for anything bad in the prev code block and see if that helps your use case.

Changed in sbackup:
status: Unconfirmed → In Progress
Revision history for this message
Oliver Gerlich (ogerlich) wrote :

Yup - already working on that :-D the prev.count() check indeed seems to take lots of time, as it seems to go over the whole list of previous files for every file. Attached is a patch that turns this into a hash lookup: the prev list is copied to a hash (prevHash) after it has been completely filled, and then prevHash.has_key() is used for lookup. This avoids meddling with the prev list creation itself :-) but I guess the prev list (which is not used after prevHash has been filled) still occupies memory after finishing - might be another thing to look into. Anyway, creation of the prev list itself seems to be pretty quick (takes few seconds).

The patched sbackupd has now just finished five incremental backups, with some changes to the test directory structure in between. Time ranges between 20 seconds for incremental backup of an unchanged directory structure, to 8 minutes for incremental backup after changing all file attributes in the tree with `chmod -R u-w sbackup-test/root/` ... I guess that's fast enough for me at the moment (will try this on the real system the next days).

Revision history for this message
Oumar Aziz OUATTARA (wattazoum) wrote :

Hi,

i have tested your patches and the speed is very impressive. I have integrate your patches as they are for now .

Thank you

Revision history for this message
Oumar Aziz OUATTARA (wattazoum) wrote :

Fixed upstream

Changed in sbackup:
assignee: nobody → wattazoum
status: In Progress → Fix Committed
Revision history for this message
Oumar Aziz OUATTARA (wattazoum) wrote :

If you're interested in testing the actual version of sbackup . Here is an attachement :

Format: 1.7
Date: Mon, 7 May 2007 22:56:34 +0200
Source: sbackup
Binary: sbackup
Architecture: source all
Version: 0.10.4beta2
Distribution: feisty
Urgency: low
Maintainer: Aigars Mahinovs <email address hidden>
Changed-By: Ouattara Oumar Aziz (alias wattazoum) <email address hidden>
Description:
 sbackup - Simple Backup Suite for desktop use
Changes:
 sbackup (0.10.4beta2) feisty; urgency=low
 .
   * Bug #112540 fix : sbackup now runs under root:admin. directories are created with read access for admins
   * Bug #102577 fix : optimizing the backup process ( thanks to Oliver Gerlich )

Revision history for this message
Oumar Aziz OUATTARA (wattazoum) wrote :

please upgrade to the version given there : https://bugs.launchpad.net/ubuntu/+source/sbackup/+bug/112540/comments/8

There is a big bug in the beta2

Revision history for this message
Luca Falavigna (dktrkranz) wrote :

This should be fixed in version 0.10.4.

Changed in sbackup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.