Symspell Triggers Appear to Slow Record Ingest

Bug #1968602 reported by Jason Stephenson
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Evergreen
New
Undecided
Unassigned

Bug Description

OpenSRF version: N/A
Evergreen version: rel_3_7 (with patch from bug 1931737 applied)
PostgreSQL version: 10.20 (Ubuntu 10.20-1.pgdg20.04+1)

While the patch from bug 1947173 appears to have sped up the initial setup of symspell/Did You Mean, the feature itself seems to have slowed the overall speed of ingesting records.

I started a parallel ingest using pingest.pl with 5 child processes on Friday 4/8. It ran for 75.5 hours and only 5 out of 228 batches of 10,000 records had been processed. Based on the runtime of the processes, including the main process, it appears that it took approximately 40 hours to process the first 5 batches of files. Based on the runtimes of the current processes, it looks like it will take a total of 40 hours to process the current 5 batches. A back of the envelope calculation indicates that it will take 76 days to complete the entire reingest process.

A recent reingest of more or less the same data with the symspell triggers disabled took approximately 5 days to complete 228 batches of 10,000 records on the same hardware.

Tags: didyoumean
Revision history for this message
Jason Stephenson (jstephenson) wrote (last edit ):

Performance seems to degrade as time goes on. The second group of 5 files are still being processed, with the longest running one having run for 55 hours and 5 minutes at this point. My previous estimates are way off.

If anyone else wants to investigate this, I did the following steps:

1. I made a branch based on the latest rel_3_7 and added the rel_3_7 branch from bug 1931737.

2. Upgraded a copy of production data from Evergreen 3.5.3 to this branch using a custom database upgrade generated with this script: https://gist.github.com/Dyrcona/00bd6b6290b6fbbb579c7f93b360ab0d

3. I ran through the steps to initialize the symspell/Did You Mean feature as outlined in the 1306 DB upgrade script.

4. I ran pingest.pl on a vitrual machine that can connect to the database used for the above steps. The only option used with pingest.pl was --max-child=5.

I have also done the above, skipping step 3 and disabling the symspell triggers, and this runs much more quickly with the entire pingest finishing in 4 or 5 days on the same data. I use the following to disable the symspell triggers: https://pastebin.com/PBmHVJ1q.

I will continue to do more experimentation as time permits, perhaps doing the ingest with the symspell triggers enabled and skipping step 3 to see if that makes any difference.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Running a test of pingest.pl with the symspell triggers enabled, but without doing the setup, starts out faster. After 70 hours, 27 batches of 10,000 and the browse ingest have finished. However, subsequeent batches seem to take longer as time goes. Only 5 batches have finished in the past 20 hours. The longest running so far has run for 16 hours, and it looks like the batch that it replaced ran for approximately 14 hours.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I'm making this a duplicate of bug 1931737 because there's a fix for this bug included in the branch for that one.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Further testing of bug 1931737 reveals that this bug should not be a duplicate of that one. Record ingest is still slowed greatly by the symspell/Did You mean code. I'll update this bug in a few days with my findings.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Further testing shows that the --delay-symspell option added to pingest.pl in the lates changes for bug 1931737 does help with the performance issues. Using that option with the patch prevents the exponential performance degradation of parallel ingest without that option.

In my testing, it takes about 6 hours now to process a batch of 10,000 records while running 8 batches on my test hardware. With symspell completely disabled, it takes about 4 hours to do the same. Performance appears to be roughly constant for batches of 10,000 records with only fractional variation in the duration.

tags: added: didyoumean
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.