Evergreen

relevance adjustment makes searches slow in 2.0

Bug #844374 reported by James Fournie on 2011-09-08

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Evergreen	Won't Fix	Undecided	Unassigned

Bug Description

Evergreen 2.0.5
Postgres 8.4

adding values to search.relevance_adjustment really slows down searches, far more than it should. investigating, I've found this relates to naco_normalize running on all of the comparisons. Here is the SQL from our postgres log for a relevance-adjusted query (keyword for dinosaur):

EXPLAIN ANALYZE SELECT m.source AS id,
ARRAY_ACCUM(DISTINCT m.source) AS records,
(AVG(
(rank(x80878b0_keyword.index_vector, x80878b0_keyword.tsq) * x80878b0_keyword.weight
* COALESCE(NULLIF( (naco_normalize(x80878b0_keyword.value) ~ naco_normalize(E'dinosaur')), FALSE )::INT * 2, 1)
* COALESCE(NULLIF( (naco_normalize(x80878b0_keyword.value) ~ ('^'||naco_normalize(E'dinosaur'))), FALSE )::INT * 5, 1)
* COALESCE(NULLIF( (naco_normalize(x80878b0_keyword.value) ~ ('^'||naco_normalize(E'dinosaur')||'$')), FALSE )::INT * 5, 1))
) * COALESCE( NULLIF( FIRST(mrd.item_lang) = $_981$eng$_981$ , FALSE )::INT * 5, 1))::NUMERIC AS rel,
(AVG(
(rank(x80878b0_keyword.index_vector, x80878b0_keyword.tsq) * x80878b0_keyword.weight
* COALESCE(NULLIF( (naco_normalize(x80878b0_keyword.value) ~ (naco_normalize(E'dinosaur'))), FALSE )::INT * 2, 1)
* COALESCE(NULLIF( (naco_normalize(x80878b0_keyword.value) ~ ('^'||naco_normalize(E'dinosaur'))), FALSE )::INT * 5, 1)
* COALESCE(NULLIF( (naco_normalize(x80878b0_keyword.value) ~ ('^'||naco_normalize(E'dinosaur')||'$')), FALSE )::INT * 5, 1))
) * COALESCE( NULLIF( FIRST(mrd.item_lang) = $_981$eng$_981$ , FALSE )::INT * 5, 1))::NUMERIC AS rank,
FIRST(mrd.date1) AS tie_break
FROM metabib.metarecord_source_map m
JOIN metabib.rec_descriptor mrd ON (m.source = mrd.record)
LEFT JOIN (
SELECT fe.index_vector, naco_normalize(fe.value) as value, fe.id, fe.source, fe_weight.weight, x.tsq /* search */
  FROM metabib.keyword_field_entry AS fe
JOIN config.metabib_field AS fe_weight ON (fe_weight.id = fe.field)
JOIN (SELECT
to_tsquery('keyword', COALESCE(NULLIF( '(' || btrim(regexp_replace(split_date_range(naco_normalize(E'dinosaur')),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), '')) AS tsq ) AS x ON (fe.index_vector @@ x.tsq)
) AS x80878b0_keyword ON (m.source = x80878b0_keyword.source)
  WHERE 1=1
        AND ((x80878b0_keyword.id IS NOT NULL))
  GROUP BY 1
  ORDER BY 4 DESC NULLS LAST, 5 DESC NULLS LAST, 3 DESC
  LIMIT 10000
;

this takes about 8 seconds on our server, but 500 ms when the naco_normalizes on the left side of the regex comparisons are removed. so something is seriously wrong there -- I have tried a few different things but can't get the query planner to cooperate, it seems the naco_normalize is not being immutable.

not sure what the best course of action would be to correct this performance issue, my only thoughts are either to remove the normalization (and have slightly poorer relevancy) or to materialize the normalized values.

I have also tried normalizing within the SELECT AS x80878b0_keyword, but that doesn't help either,

Revision history for this message

Dan Wells (dbw2) wrote on 2011-09-08:

Hello James,

Thank you for this detailed report. This is a known issue which was discussed a few months ago, but I'll be darned if I can find that discussion. The current status is that newer versions of EG have switched from rank() to rank_cd() for the relevance ranking, and this change greatly reduces or eliminates the need for the internal EG adjustments. We made this change to our 2.0 instance manually about a week after upgrading, as some searches were unbearably long (2+ minutes), and are satisfied with the relevance supplied by rank_cd() alone. I am unsure if this change got into any official 2.0 releases, or if it is only 2.1+, but it is ultimately only a small code change if you wish to apply it yourself.

I did spend a few hours one day back in April trying to see if this could be fixed by indexing changes alone (rather than a materialized table), but ultimately had to shelve the idea in favor of more pressing concerns (which is probably a good summary of how many of us view this issue). There is also the thought of creating our own "rank_eg()" type function, but that is not likely to happen short-term.

Dan W.

Revision history for this message

James Fournie (jfournie) wrote on 2011-09-08:

Thanks Dan,

I went through this same investigation process yesterday, I wasn't able to find any record of a public conversation (although I find it's hard to accurately search IRC)

For anyone playing along at home, I've dug up the commit and it is here:

http://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=307a4371

I think the rank_cd looks about as functional as the previous relevancy adjustment table, so based on glancing at the code I'd suggest that the old table be deprecated and if possible, some kind of advisory sent out that it significantly slows things down -- I am fairly certain that there are other sites that are getting 2+ minute searches and they don't know about this problem.

~James

Revision history for this message

Dan Scott (denials) wrote on 2011-09-08:

Were you looking for https://bugs.launchpad.net/evergreen/+bug/788379 perhaps?

Revision history for this message

Mike Rylander (mrylander) wrote on 2012-05-17:

For investigative purposes, I'm bringing back James' original idea of removing naco_normalize (now spelled search_normalize) from the left side of the comparison.

In order to support this we need to apply all normalizers to the value column in the field entry tables. That's done in a straight-forward way be just pushing the data into the column at the appropriate time in the appropriate trigger. There is a small loss of accuracy, because some fields don't get that normalizations when being stored, but the user-supplied value will still be normalized. But, these are just for relevance adjustment and not matching, so it's not too bad IMO.

Next, we make QueryParser forget about applying search_normalize to the column (but still to the value).

Testing appreciated! (basically untested at this point)

You'll need to rewrite the the field entry tables. The following should be enough:

UPDATE metabib.author_field_entry SET id = id;
UPDATE metabib.title_field_entry SET id = id;
UPDATE metabib.subject_field_entry SET id = id;
UPDATE metabib.series_field_entry SET id = id;
UPDATE metabib.keyword_field_entry SET id = id;
UPDATE metabib.identifier_field_entry SET id = id;

Here's the WIP branch: http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/reduce-normalization

Revision history for this message

Mike Rylander (mrylander) wrote on 2012-05-17:

As requested by Dan Scott: http://pastebin.com/FSW9EDem

Even on a very small data set (47k bibs) shows an enormous improvement. More work and thought is needed, but it's a start.

Jason Stephenson (jstephenson) on 2012-12-12

Changed in evergreen:
status:	New → Incomplete

Jason Stephenson (jstephenson) on 2012-12-12

Changed in evergreen:
status:	Incomplete → Triaged

Revision history for this message

Michele Morgan (mmorgan) wrote on 2019-03-08:

Marking Won't Fix due to age and lack of activity.

Changed in evergreen:
status:	Triaged → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.