---
title: "Lucene 的倒排索引设计（一）"
date: 2026-02-23
lastmod: 2026-02-23
description: "深入讲解 Lucene 倒排索引的核心数据结构，详细分析 Posting 数据结构中的各个字段含义和作用，包括 docFreq、freqStart、proxStart 等，以及倒排索引中存储的关键信息如 docID、词频、位置、偏移等，介绍 ByteBlockPool 和 CharBlockPool 的内存管理机制，帮助开发者理解 Lucene 高效检索的底层实现。"
tags: ["Lucene", "倒排索引", "Posting"]
summary: "倒排索引 #  倒排索引（Inverted Index）是一种数据库索引，存储着从文档中的分词（Term）到文档（Document）的映射。它的主要用途是允许进行快速的全文搜索。
倒排索引的基本结构如下所示：
term1 -&gt; &lt;doc_1, doc_2, ..., doc_n&gt; term2 -&gt; &lt;doc_x, doc_y, ...&gt; ... Lucene 在实际构建倒排索引时，需要考虑支持一些复杂的搜索场景，因此存储的信息会更丰富。例如，它不仅记录分词在哪些文档中出现，还会记录分词在文档中出现的频度（Frequency）及位置（Position），以支持高效的短语搜索和相关性评分功能。
Posting 数据结构 #  在 Lucene 中，一个文档（Document）由若干个域（Field）组成，每个域通过分词解析后输出一串分词（Term）。
Lucene 为每个分词初始化一个倒排列表（Posting），用来表达分词在当前段索引下的信息。
倒排列表结构
private final static class Posting { int textStart; // Address into char[] blocks where our text is stored  int docFreq; // # times this term occurs in the current doc  int freqStart; // Address of first byte[] slice for freq  int freqUpto; // Next write address for freq  int proxStart; // Address of first byte[] slice for prox  int proxUpto; // Next write address for prox  int lastDocID; // Last docID where this term occurred  int lastDocCode; // Code for prior doc  int lastPosition; // Last position where this term occurred  PostingVector vector; // Corresponding PostingVector instance } 倒排列表中存储了每个分词在当前段中的关键信息。例如："
---


### 倒排索引

倒排索引（Inverted Index）是一种数据库索引，存储着从文档中的分词（Term）到文档（Document）的映射。它的主要用途是允许进行快速的全文搜索。

倒排索引的基本结构如下所示：

```bash
term1 -> <doc_1, doc_2, ..., doc_n>
term2 -> <doc_x, doc_y, ...>
...
```

Lucene 在实际构建倒排索引时，需要考虑支持一些复杂的搜索场景，因此存储的信息会更丰富。例如，它不仅记录分词在哪些文档中出现，还会记录分词在文档中出现的频度（Frequency）及位置（Position），以支持高效的短语搜索和相关性评分功能。

### Posting 数据结构

在 Lucene 中，一个文档（Document）由若干个域（Field）组成，每个域通过分词解析后输出一串分词（Term）。

Lucene 为每个分词初始化一个倒排列表（Posting），用来表达分词在当前段索引下的信息。

**倒排列表结构**

```java
private final static class Posting {
    int textStart;       // Address into char[] blocks where our text is stored
    int docFreq;         // # times this term occurs in the current doc
    int freqStart;       // Address of first byte[] slice for freq
    int freqUpto;        // Next write address for freq
    int proxStart;       // Address of first byte[] slice for prox
    int proxUpto;        // Next write address for prox
    int lastDocID;       // Last docID where this term occurred
    int lastDocCode;     // Code for prior doc
    int lastPosition;    // Last position where this term occurred
    PostingVector vector; // Corresponding PostingVector instance
}
```

倒排列表中存储了每个分词在当前段中的关键信息。例如：

- 当前分词的字符串值（由 `CharBlockPool` 存储缓冲管理）
- 当前段中哪些文档包含当前的分词
- 当前分词在每个文档中出现的具体位置（由 `ByteBlockPool` 存储缓冲管理）

倒排索引中主要包含如下信息：

- **docID**：用于唯一定位文档。
- **词频（Term Frequency, TF）**：某分词在该文档中出现的次数。
- **位置（Position）**：某分词在文档中的位置。
- **偏移（Offset）**：某分词开始和结束的位置。

上述信息存储在一段动态扩展的内存空间中，分别由 `ByteBlockPool` 或者 `CharBlockPool` 来组织与管理。倒排列表中大多数字段指向 `ByteBlockPool` 缓存的地址偏移量。从设计角度来看，将其称为 `PostingIndex` 可能会比 `ByteBlockPool` 更加准确。