Lucene 的倒排索引设计（一）

倒排索引 #

倒排索引（Inverted Index）是一种数据库索引，存储着从文档中的分词（Term）到文档（Document）的映射。它的主要用途是允许进行快速的全文搜索。

倒排索引的基本结构如下所示：

term1 -> <doc_1, doc_2, ..., doc_n>
term2 -> <doc_x, doc_y, ...>
...

Lucene 在实际构建倒排索引时，需要考虑支持一些复杂的搜索场景，因此存储的信息会更丰富。例如，它不仅记录分词在哪些文档中出现，还会记录分词在文档中出现的频度（Frequency）及位置（Position），以支持高效的短语搜索和相关性评分功能。

Posting 数据结构 #

在 Lucene 中，一个文档（Document）由若干个域（Field）组成，每个域通过分词解析后输出一串分词（Term）。

Lucene 为每个分词初始化一个倒排列表（Posting），用来表达分词在当前段索引下的信息。

倒排列表结构

private final static class Posting {
    int textStart;       // Address into char[] blocks where our text is stored
    int docFreq;         // # times this term occurs in the current doc
    int freqStart;       // Address of first byte[] slice for freq
    int freqUpto;        // Next write address for freq
    int proxStart;       // Address of first byte[] slice for prox
    int proxUpto;        // Next write address for prox
    int lastDocID;       // Last docID where this term occurred
    int lastDocCode;     // Code for prior doc
    int lastPosition;    // Last position where this term occurred
    PostingVector vector; // Corresponding PostingVector instance
}

倒排列表中存储了每个分词在当前段中的关键信息。例如：

当前分词的字符串值（由 CharBlockPool 存储缓冲管理）
当前段中哪些文档包含当前的分词
当前分词在每个文档中出现的具体位置（由 ByteBlockPool 存储缓冲管理）

倒排索引中主要包含如下信息：

docID：用于唯一定位文档。
词频（Term Frequency, TF）：某分词在该文档中出现的次数。
位置（Position）：某分词在文档中的位置。
偏移（Offset）：某分词开始和结束的位置。

上述信息存储在一段动态扩展的内存空间中，分别由 ByteBlockPool 或者 CharBlockPool 来组织与管理。倒排列表中大多数字段指向 ByteBlockPool 缓存的地址偏移量。从设计角度来看，将其称为 PostingIndex 可能会比 ByteBlockPool 更加准确。

标签