relevance adjustment makes searches slow in 2.0

Bug #844374 reported by James Fournie
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Evergreen
Won't Fix
Undecided
Unassigned

Bug Description

Evergreen 2.0.5
Postgres 8.4

adding values to search.relevance_adjustment really slows down searches, far more than it should. investigating, I've found this relates to naco_normalize running on all of the comparisons. Here is the SQL from our postgres log for a relevance-adjusted query (keyword for dinosaur):

EXPLAIN ANALYZE SELECT m.source AS id,
ARRAY_ACCUM(DISTINCT m.source) AS records,
(AVG(
(rank(x80878b0_keyword.index_vector, x80878b0_keyword.tsq) * x80878b0_keyword.weight
* COALESCE(NULLIF( (naco_normalize(x80878b0_keyword.value) ~ naco_normalize(E'dinosaur')), FALSE )::INT * 2, 1)
* COALESCE(NULLIF( (naco_normalize(x80878b0_keyword.value) ~ ('^'||naco_normalize(E'dinosaur'))), FALSE )::INT * 5, 1)
* COALESCE(NULLIF( (naco_normalize(x80878b0_keyword.value) ~ ('^'||naco_normalize(E'dinosaur')||'$')), FALSE )::INT * 5, 1))
) * COALESCE( NULLIF( FIRST(mrd.item_lang) = $_981$eng$_981$ , FALSE )::INT * 5, 1))::NUMERIC AS rel,
(AVG(
(rank(x80878b0_keyword.index_vector, x80878b0_keyword.tsq) * x80878b0_keyword.weight
* COALESCE(NULLIF( (naco_normalize(x80878b0_keyword.value) ~ (naco_normalize(E'dinosaur'))), FALSE )::INT * 2, 1)
* COALESCE(NULLIF( (naco_normalize(x80878b0_keyword.value) ~ ('^'||naco_normalize(E'dinosaur'))), FALSE )::INT * 5, 1)
* COALESCE(NULLIF( (naco_normalize(x80878b0_keyword.value) ~ ('^'||naco_normalize(E'dinosaur')||'$')), FALSE )::INT * 5, 1))
) * COALESCE( NULLIF( FIRST(mrd.item_lang) = $_981$eng$_981$ , FALSE )::INT * 5, 1))::NUMERIC AS rank,
FIRST(mrd.date1) AS tie_break
FROM metabib.metarecord_source_map m
JOIN metabib.rec_descriptor mrd ON (m.source = mrd.record)
LEFT JOIN (
SELECT fe.index_vector, naco_normalize(fe.value) as value, fe.id, fe.source, fe_weight.weight, x.tsq /* search */
  FROM metabib.keyword_field_entry AS fe
JOIN config.metabib_field AS fe_weight ON (fe_weight.id = fe.field)
JOIN (SELECT
to_tsquery('keyword', COALESCE(NULLIF( '(' || btrim(regexp_replace(split_date_range(naco_normalize(E'dinosaur')),E'(?:\\s+|:)','&','g'),'&|') || ')', '()'), '')) AS tsq ) AS x ON (fe.index_vector @@ x.tsq)
) AS x80878b0_keyword ON (m.source = x80878b0_keyword.source)
  WHERE 1=1
        AND ((x80878b0_keyword.id IS NOT NULL))
  GROUP BY 1
  ORDER BY 4 DESC NULLS LAST, 5 DESC NULLS LAST, 3 DESC
  LIMIT 10000
;

this takes about 8 seconds on our server, but 500 ms when the naco_normalizes on the left side of the regex comparisons are removed. so something is seriously wrong there -- I have tried a few different things but can't get the query planner to cooperate, it seems the naco_normalize is not being immutable.

not sure what the best course of action would be to correct this performance issue, my only thoughts are either to remove the normalization (and have slightly poorer relevancy) or to materialize the normalized values.

I have also tried normalizing within the SELECT AS x80878b0_keyword, but that doesn't help either,

Revision history for this message
Dan Wells (dbw2) wrote :

Hello James,

Thank you for this detailed report. This is a known issue which was discussed a few months ago, but I'll be darned if I can find that discussion. The current status is that newer versions of EG have switched from rank() to rank_cd() for the relevance ranking, and this change greatly reduces or eliminates the need for the internal EG adjustments. We made this change to our 2.0 instance manually about a week after upgrading, as some searches were unbearably long (2+ minutes), and are satisfied with the relevance supplied by rank_cd() alone. I am unsure if this change got into any official 2.0 releases, or if it is only 2.1+, but it is ultimately only a small code change if you wish to apply it yourself.

I did spend a few hours one day back in April trying to see if this could be fixed by indexing changes alone (rather than a materialized table), but ultimately had to shelve the idea in favor of more pressing concerns (which is probably a good summary of how many of us view this issue). There is also the thought of creating our own "rank_eg()" type function, but that is not likely to happen short-term.

Dan W.

Revision history for this message
James Fournie (jfournie) wrote :

Thanks Dan,

I went through this same investigation process yesterday, I wasn't able to find any record of a public conversation (although I find it's hard to accurately search IRC)

For anyone playing along at home, I've dug up the commit and it is here:

http://git.evergreen-ils.org/?p=Evergreen.git;a=commitdiff;h=307a4371

I think the rank_cd looks about as functional as the previous relevancy adjustment table, so based on glancing at the code I'd suggest that the old table be deprecated and if possible, some kind of advisory sent out that it significantly slows things down -- I am fairly certain that there are other sites that are getting 2+ minute searches and they don't know about this problem.

~James

Revision history for this message
Dan Scott (denials) wrote :
Revision history for this message
Mike Rylander (mrylander) wrote :

For investigative purposes, I'm bringing back James' original idea of removing naco_normalize (now spelled search_normalize) from the left side of the comparison.

In order to support this we need to apply all normalizers to the value column in the field entry tables. That's done in a straight-forward way be just pushing the data into the column at the appropriate time in the appropriate trigger. There is a small loss of accuracy, because some fields don't get that normalizations when being stored, but the user-supplied value will still be normalized. But, these are just for relevance adjustment and not matching, so it's not too bad IMO.

Next, we make QueryParser forget about applying search_normalize to the column (but still to the value).

Testing appreciated! (basically untested at this point)

You'll need to rewrite the the field entry tables. The following should be enough:

UPDATE metabib.author_field_entry SET id = id;
UPDATE metabib.title_field_entry SET id = id;
UPDATE metabib.subject_field_entry SET id = id;
UPDATE metabib.series_field_entry SET id = id;
UPDATE metabib.keyword_field_entry SET id = id;
UPDATE metabib.identifier_field_entry SET id = id;

Here's the WIP branch: http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/miker/reduce-normalization

Revision history for this message
Mike Rylander (mrylander) wrote :

As requested by Dan Scott: http://pastebin.com/FSW9EDem

Even on a very small data set (47k bibs) shows an enormous improvement. More work and thought is needed, but it's a start.

Changed in evergreen:
status: New → Incomplete
Changed in evergreen:
status: Incomplete → Triaged
Revision history for this message
Michele Morgan (mmorgan) wrote :

Marking Won't Fix due to age and lack of activity.

Changed in evergreen:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.