Symspell Triggers Appear to Slow Record Ingest
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Evergreen |
New
|
Undecided
|
Unassigned |
Bug Description
OpenSRF version: N/A
Evergreen version: rel_3_7 (with patch from bug 1931737 applied)
PostgreSQL version: 10.20 (Ubuntu 10.20-1.
While the patch from bug 1947173 appears to have sped up the initial setup of symspell/Did You Mean, the feature itself seems to have slowed the overall speed of ingesting records.
I started a parallel ingest using pingest.pl with 5 child processes on Friday 4/8. It ran for 75.5 hours and only 5 out of 228 batches of 10,000 records had been processed. Based on the runtime of the processes, including the main process, it appears that it took approximately 40 hours to process the first 5 batches of files. Based on the runtimes of the current processes, it looks like it will take a total of 40 hours to process the current 5 batches. A back of the envelope calculation indicates that it will take 76 days to complete the entire reingest process.
A recent reingest of more or less the same data with the symspell triggers disabled took approximately 5 days to complete 228 batches of 10,000 records on the same hardware.
tags: | added: didyoumean |
Performance seems to degrade as time goes on. The second group of 5 files are still being processed, with the longest running one having run for 55 hours and 5 minutes at this point. My previous estimates are way off.
If anyone else wants to investigate this, I did the following steps:
1. I made a branch based on the latest rel_3_7 and added the rel_3_7 branch from bug 1931737.
2. Upgraded a copy of production data from Evergreen 3.5.3 to this branch using a custom database upgrade generated with this script: https:/ /gist.github. com/Dyrcona/ 00bd6b6290b6fbb b579c7f93b360ab 0d
3. I ran through the steps to initialize the symspell/Did You Mean feature as outlined in the 1306 DB upgrade script.
4. I ran pingest.pl on a vitrual machine that can connect to the database used for the above steps. The only option used with pingest.pl was --max-child=5.
I have also done the above, skipping step 3 and disabling the symspell triggers, and this runs much more quickly with the entire pingest finishing in 4 or 5 days on the same data. I use the following to disable the symspell triggers: https:/ /pastebin. com/PBmHVJ1q.
I will continue to do more experimentation as time permits, perhaps doing the ingest with the symspell triggers enabled and skipping step 3 to see if that makes any difference.