E.41. rum

E.41. rum
Prev	Up	Appendix E. 额外提供的模块	Home	Next

E.41.1. 介绍
E.41.2. 常用操作符和函数
E.41.3. 操作符类

E.41.1. 介绍

rum 模块提供了使用 RUM 索引的方法。它基于 GIN 访问方法代码实现。

GIN 索引允许使用 tsvector 和 tsquery 类型进行快速全文搜索。但是，使用 GIN 索引进行全文搜索存在几个问题：

排序慢。需要词元的位置信息才能进行排序。 GIN 索引不存储词元的位置。因此，在索引扫描之后，我们需要进行额外的堆扫描来检索词元的位置。
使用 GIN 索引进行短语搜索较慢。这个问题与上一个问题有关。执行短语搜索需要位置信息。
按时间戳排序慢。 GIN 索引无法在词元中存储某些相关信息。因此需要执行额外的堆扫描。

RUM 通过在 posting 树中存储附加信息来解决这些问题。例如，词元的位置信息或时间戳。

RUM 的缺点是它的构建和插入时间比 GIN 慢。这是因为我们需要存储键之外的附加信息，并且 RUM 使用通用 WAL 记录。

E.41.2. 常用操作符和函数

rum 模块提供了以下操作符。

操作符	返回值	描述
tsvector <=> tsquery	float4	返回 tsvector 和 tsquery 之间的距离。
timestamp <=> timestamp	float8	返回两个时间戳之间的距离。
timestamp <=\| timestamp	float8	仅返回左侧时间戳的距离。
timestamp \|=> timestamp	float8	仅返回右侧时间戳的距离。

最后三个操作也适用于以下类型：timestamptz、int2、int4、int8、float4、float8 和 oid。

E.41.3. 操作符类

rum 提供了以下操作符类。

E.41.3.1. rum_tsvector_ops

适用类型：tsvector

该操作符类存储了带有位置信息的 tsvector 词元。支持使用 <=> 操作符进行排序和前缀搜索。下面是一个示例。

假设我们有以下表：

CREATE TABLE test_rum(t text, a tsvector);

CREATE TRIGGER tsvectorupdate
BEFORE UPDATE OR INSERT ON test_rum
FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('a', 'pg_catalog.english', 't');

INSERT INTO test_rum(t) VALUES ('The situation is most beautiful');
INSERT INTO test_rum(t) VALUES ('It is a beautiful');
INSERT INTO test_rum(t) VALUES ('It looks like a beautiful place');

要创建 rum 索引，我们需要创建一个扩展：

CREATE EXTENSION rum;

然后我们可以创建新的索引：

CREATE INDEX rumidx ON test_rum USING rum (a rum_tsvector_ops);

我们可以执行以下查询：

SELECT t, a <=> to_tsquery('english', 'beautiful | place') AS rank
    FROM test_rum
    WHERE a @@ to_tsquery('english', 'beautiful | place')
    ORDER BY a <=> to_tsquery('english', 'beautiful | place');
                t                |  rank
---------------------------------+---------
 It looks like a beautiful place | 8.22467
 The situation is most beautiful | 16.4493
 It is a beautiful               | 16.4493
(3 rows)

SELECT t, a <=> to_tsquery('english', 'place | situation') AS rank
    FROM test_rum
    WHERE a @@ to_tsquery('english', 'place | situation')
    ORDER BY a <=> to_tsquery('english', 'place | situation');
                t                |  rank
---------------------------------+---------
 The situation is most beautiful | 16.4493
 It looks like a beautiful place | 16.4493
(2 rows)

E.41.3.2. rum_tsvector_hash_ops

适用类型：tsvector

该操作符类存储带有位置信息的 tsvector 词元的哈希值。支持使用 <=> 操作符进行排序。但是，它不支持前缀搜索。

E.41.3.3. rum_TYPE_ops

适用类型：int2、int4、int8、float4、float8、oid、time、timetz、date、interval、macaddr、inet、cidr、text、varchar、char、bytea、bit、varbit、numeric、timestamp、timestamptz

支持的操作：对于所有类型，支持 <、<=、=、>=、 >，对于 int2、int4、int8、float4、float8、oid、timestamp 和 timestamptz 类型，还支持 <=>、<=| 和 |=>。

支持使用 <=>、<=| 和 |=> 操作符进行排序。可以与 rum_tsvector_addon_ops、rum_tsvector_hash_addon_ops 和 rum_anyarray_addon_ops 操作符类一起使用。

E.41.3.4. rum_tsvector_addon_ops

适用类型：tsvector

该操作符类可以使用模块支持的任何字段存储带有位置信息的 tsvector 词元。下面是一个示例。

假设我们有以下表：

CREATE TABLE tsts (id int, t tsvector, d timestamp);

\copy tsts from 'rum/data/tsts.data'

CREATE INDEX tsts_idx ON tsts USING rum (t rum_tsvector_addon_ops, d)
    WITH (attach = 'd', to = 't');

现在我们可以执行以下查询：

EXPLAIN (costs off)
    SELECT id, d, d <=> '2016-05-16 14:21:25' FROM tsts WHERE t @@ 'wr&qh' ORDER BY d <=> '2016-05-16 14:21:25' LIMIT 5;
                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Limit
   ->  Index Scan using tsts_idx on tsts
         Index Cond: (t @@ '''wr'' & ''qh'''::tsquery)
         Order By: (d <=> 'Mon May 16 14:21:25 2016'::timestamp without time zone)
(4 rows)

SELECT id, d, d <=> '2016-05-16 14:21:25' FROM tsts WHERE t @@ 'wr&qh' ORDER BY d <=> '2016-05-16 14:21:25' LIMIT 5;
 id  |                d                |   ?column?
-----+---------------------------------+---------------
 355 | Mon May 16 14:21:22.326724 2016 |      2.673276
 354 | Mon May 16 13:21:22.326724 2016 |   3602.673276
 371 | Tue May 17 06:21:22.326724 2016 |  57597.326724
 406 | Wed May 18 17:21:22.326724 2016 | 183597.326724
 415 | Thu May 19 02:21:22.326724 2016 | 215997.326724
(5 rows)

警告： 当使用传递引用附加信息的排序创建索引时，目前的 RUM 存在错误行为。这是因为 posting 树具有固定长度的右边界和固定长度的非叶 posting 项。不允许创建这样的索引。

E.41.3.5. rum_tsvector_hash_addon_ops

适用类型：tsvector

该操作符类可以使用模块支持的任何字段存储带有位置信息的 tsvector 词元的哈希值。

不支持前缀搜索。

E.41.3.6. rum_tsquery_ops

适用类型：tsquery

将查询树的分支存储在附加信息中。例如，我们有以下表：

CREATE TABLE query (q tsquery, tag text);

INSERT INTO query VALUES ('supernova & star', 'sn'),
    ('black', 'color'),
    ('big & bang & black & hole', 'bang'),
    ('spiral & galaxy', 'shape'),
    ('black & hole', 'color');

CREATE INDEX query_idx ON query USING rum(q);

现在我们可以执行以下快速查询：

SELECT * FROM query
    WHERE to_tsvector('black holes never exists before we think about them') @@ q;
        q         |  tag
------------------+-------
 'black'          | color
 'black' & 'hole' | color
(2 rows)

E.41.3.7. rum_anyarray_ops

对于类型：anyarray

此操作符类使用数组的长度存储anyarray元素。支持运算符&&，@>，<@， =，%。使用<=>运算符进行排序。例如，我们有以下表：

CREATE TABLE test_array (i int2[]);

INSERT INTO test_array VALUES ('{}'), ('{0}'), ('{1,2,3,4}'), ('{1,2,3}'), ('{1,2}'), ('{1}');

CREATE INDEX idx_array ON test_array USING rum (i rum_anyarray_ops);

现在我们可以使用索引扫描执行查询：

SET enable_seqscan TO off;

EXPLAIN (COSTS OFF) SELECT * FROM test_array WHERE i && '{1}' ORDER BY i <=> '{1}' ASC;
                QUERY PLAN
------------------------------------------
 Index Scan using idx_array on test_array
   Index Cond: (i && '{1}'::smallint[])
   Order By: (i <=> '{1}'::smallint[])
(3 rows

SELECT * FROM test_array WHERE i && '{1}' ORDER BY i <=> '{1}' ASC;
     i
-----------
 {1}
 {1,2}
 {1,2,3}
 {1,2,3,4}
(4 rows)

E.41.3.8. rum_anyarray_addon_ops

对于类型：anyarray

此操作符类使用模块支持的任何字段存储anyarray元素。

Prev	Up	Next
E.40. pldebugger	Home	E.42. system_stats