Wish List - Enhanced Concerto dataset

Bug #1901932 reported by Ruth Frasur Davis
42
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Wishlist
Unassigned

Bug Description

Concerto should be enhanced to include a more diverse and robust dataset for testing and training real world scenarios.

Discussion Doc at https://docs.google.com/document/d/1S0b8A6oLFd3TmmThrcSoaX4CJLBHkIue7QLxQJziZ5I/edit?usp=sharing

description: updated
Changed in evergreen:
importance: Undecided → Wishlist
Revision history for this message
Bill Erickson (berick) wrote :

I have several exports of the collected data living here:

https://github.com/berick/evergreen-datasets

The current dataset is compatible with EG DB stamp 1326 (EG master as of this moment).

To see the new data in action:

1. Checkout & install a version of EG where the highest DB stamp is 1326.
2. git clone https://github.com/berick/evergreen-datasets
3. cd evergreen-datasets
4. ./create-database.sh

This will create a new database called "eg_1326" with all the data. To use it w/ Evergreen, either rename the database to match your config (e.g. "evergreen") or modify your EG configs.

Revision history for this message
Ruth Frasur Davis (redavis) wrote :

ECDI has entered into an agreement with MOBIUS to deliver the following:

1. Insert commands for adding the dataset to Evergreen
2. New repository to house the dataset and code
3. Documentation on installing and using the new dataset and code

Blake GH (bmagic)
Changed in evergreen:
assignee: nobody → Blake GH (bmagic)
Revision history for this message
Blake GH (bmagic) wrote :
Changed in evergreen:
assignee: Blake GH (bmagic) → nobody
tags: added: pullrequest
Revision history for this message
Blake GH (bmagic) wrote :

Squashed and force pushed. Same link

Blake GH (bmagic)
tags: removed: pullrequest
Revision history for this message
Blake GH (bmagic) wrote :

I'm going to work on an improvement to automate the upgrade process through Evergreen versions. In the meantime, anyone who's interested, could take this branch and install the enhanced concerto set (upon master) and give it a test drive.

Revision history for this message
Blake GH (bmagic) wrote :

This is finally done. I've force pushed a two-commit branch. Same link. This last commit augments our build process to incorporate the new steps for upgrading the enhanced concerto dataset. It's fairly comprehensive. most of the heavy lifting is done in the perl script. As far as I know, the last place that this needs documenting is our release build wiki page:

https://wiki.evergreen-ils.org/doku.php?id=dev:release_process:evergreen:2.8

Which might want to wait until the branch is merged.

tags: added: pullrequest
Blake GH (bmagic)
Changed in evergreen:
milestone: none → 3.11-beta
Galen Charlton (gmc)
Changed in evergreen:
status: New → Confirmed
assignee: nobody → Galen Charlton (gmc)
Revision history for this message
Galen Charlton (gmc) wrote :

After testing and discussion with interested parties, I have pushed this for inclusion in 3.11 with some follow-ups.

While the dataset itself looks OK on superficial testing, I have various concerns about the script used to generate updates to the dataset and have marked it as experimental:

- It's sensitive to the timezone where the most recent update is run, unlike Concerto, which appears to give you a database with timestamps in the installers local time zone
- Speaking of time, as far as I can tell it bakes in _exact_ due dates; that means that it _has_ to be maintained, whereas original Concerto makes its sample circ data be relative to the date of installation
- It's much more sensitive to exact column names and column changes than original Concerto is
- The update process when run today turned up a couple errors that will need to be dealt with.
- Because of the way original Concerto was designed, it largely has needed little care and attention due to Evergreen schema changes. Specifically, the process for installation original Concerto rarely needed changes unless DB tests started failing or errors were encountered during installation of a new sample database. The update mechanism for Enhanced Concerto thus far requires more work during a release to review its output.

Consequently, I'm merging this with some conditions:

- Enhanced Concerto is not hooked into the automated tests at present (and doesn't pass them). In particular, the DB-dependent t/ and live_t/ tests are still _only_ expected to run under original Concerto
- Community discussion needs to happen to decide upon a maintenance process and maintainer for this dataset - this can't be left to a last-minute run of make_release during the rush of creating a tarball.
- The dataset, or more specifically, the generation and update process, should be considered experimental until the maintenance process is sorted out, which at minimum will require a deeper review of make_concerto_from_evergreen_db.pl than anybody was able to commit to this cycle.

Thanks, ECDI and Blake!

Changed in evergreen:
status: Confirmed → Fix Committed
assignee: Galen Charlton (gmc) → nobody
Revision history for this message
Blake GH (bmagic) wrote :

All,

I promptly edited this code based upon the feedback.

Highlights:

- A date carry forward feature that carry's the various date columns forward based upon the difference between today's date and the create_date for asset.call_number in the dataset. Which is the default. This can be skipped with with: psql -v skip_date_carry='1' -f load_all.sql

- Expansion of special cases for certain tables: config.metabib_class, config.org_unit_setting_type, config.global_flag.

- Dropping these tables from consideration: acq.acq_lineitem_history, acq.acq_purchase_order_history, permission.perm_list

I couldn't think of a good way to deal with the timezone issue. My first thought was to check the column for timezonetz and emit the data onto the disk without the timezone (-06). It seems, though, that the dates land in the database OK. And with the date carry forward business, the result looks sound. The drawback is git churn when the dataset is upgraded through versions of Evergreen. It looks* like the files change because they were restored to your local database, then dumped out again.

On a timezone unrelated note:

Working on this may have revealed an issue with some of our config.org_unit_setting_type rows. The seed (950) doesn't appear to insert some of these rows. Whereas, the upgrade scripts do:
opac.did_you_mean.low_result_threshold
opac.did_you_mean.max_suggestions
search.symspell.keyboard_distance.weight
search.symspell.min_suggestion_use_threshold
search.symspell.pg_trgm.weight
search.symspell.soundex.weight

Therefore, the enhanced concerto set, today, includes them.

I've created a new branch, since this branch was merged already.

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/blake/lp1901932_enhanced_concerto_dataset_tweak

Changed in evergreen:
status: Fix Committed → Fix Released
Revision history for this message
Blake GH (bmagic) wrote :

This bug is continued in bug 2023690

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.