relevance adjustment makes searches slow in 2.0
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Evergreen |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
Evergreen 2.0.5
Postgres 8.4
adding values to search.
EXPLAIN ANALYZE SELECT m.source AS id,
ARRAY_ACCUM(
(AVG(
(rank(x80878b0_
* COALESCE(NULLIF( (naco_normalize
* COALESCE(NULLIF( (naco_normalize
* COALESCE(NULLIF( (naco_normalize
) * COALESCE( NULLIF( FIRST(mrd.
(AVG(
(rank(x80878b0_
* COALESCE(NULLIF( (naco_normalize
* COALESCE(NULLIF( (naco_normalize
* COALESCE(NULLIF( (naco_normalize
) * COALESCE( NULLIF( FIRST(mrd.
FIRST(mrd.date1) AS tie_break
FROM metabib.
JOIN metabib.
LEFT JOIN (
SELECT fe.index_vector, naco_normalize(
FROM metabib.
JOIN config.
JOIN (SELECT
to_tsquery(
) AS x80878b0_keyword ON (m.source = x80878b0_
WHERE 1=1
AND ((x80878b0_
GROUP BY 1
ORDER BY 4 DESC NULLS LAST, 5 DESC NULLS LAST, 3 DESC
LIMIT 10000
;
this takes about 8 seconds on our server, but 500 ms when the naco_normalizes on the left side of the regex comparisons are removed. so something is seriously wrong there -- I have tried a few different things but can't get the query planner to cooperate, it seems the naco_normalize is not being immutable.
not sure what the best course of action would be to correct this performance issue, my only thoughts are either to remove the normalization (and have slightly poorer relevancy) or to materialize the normalized values.
I have also tried normalizing within the SELECT AS x80878b0_keyword, but that doesn't help either,
Changed in evergreen: | |
status: | New → Incomplete |
Changed in evergreen: | |
status: | Incomplete → Triaged |
Hello James,
Thank you for this detailed report. This is a known issue which was discussed a few months ago, but I'll be darned if I can find that discussion. The current status is that newer versions of EG have switched from rank() to rank_cd() for the relevance ranking, and this change greatly reduces or eliminates the need for the internal EG adjustments. We made this change to our 2.0 instance manually about a week after upgrading, as some searches were unbearably long (2+ minutes), and are satisfied with the relevance supplied by rank_cd() alone. I am unsure if this change got into any official 2.0 releases, or if it is only 2.1+, but it is ultimately only a small code change if you wish to apply it yourself.
I did spend a few hours one day back in April trying to see if this could be fixed by indexing changes alone (rather than a materialized table), but ultimately had to shelve the idea in favor of more pressing concerns (which is probably a good summary of how many of us view this issue). There is also the thought of creating our own "rank_eg()" type function, but that is not likely to happen short-term.
Dan W.