Index encoding format for Kvrocks Search

Different from the encoding method of other data structures in Kvrocks (e.g. String, Hash, ZSet ...), Apache Kvrocks™ Search (a.k.a. Kvrocks Search) uses an independent column family (named search) and a separately designed encoding format to store indexing-related metadata and data.

Kvrocks Search is released since 2.11.0, see its docs.

Common encoding

Key types

key type	enum value
INDEX_META	0
PREFIXES	1
FIELD_META	2
FIELD	3
FIELD_ALIAS	4

The common encoding format of key is as follows:

+-------------+-------------+-------------+----------------+-------------+-------------------+
|  ns size    |  namespace  |  key type   |  idx name size |  index name |  other fields...  |
| (1byte: X)  |   (Xbyte)   |  (1byte)    |   (4bytes: Y)  |  (Y bytes)  |   (variable)      |
+-------------+-------------+-------------+----------------+-------------+-------------------+

Field types and flags

field type	enum value
tag	1
numeric	2
vector	3

The common encoding format of a field flag is:

|              8 bit                                 |
|----------------------------------------------------|
|  noindex: 1bit | field type: 4bit | reserved: 3bit |

Metadata encoding

In Kvrocks Search, metadata refers to the metadata of an index (also known as a schema), including some properties of the index, which fields are included in this index, what type each field is, and what properties they have.

Index metadata

| namespace | INDEX_META | index name |      | index flag | on data type  |
|-----------|------------|------------|  ->  |------------|---------------|
| 1+X bytes |  1 byte    | 4+Y bytes  |  ->  |   1 byte   |    1 byte     |

where index flag is currently 8-bit all reserved (equals to 0), and on data type is one of:

on data type	enum value
HASH	2
JSON	10

Index prefixes

| namespace | PREFIXES   | index name |      | prefix strings... |
|-----------|------------|------------|  ->  |-------------------|
| 1+X bytes |  1 byte    | 4+Y bytes  |  ->  |  (4+Zi)*N bytes   |

Index prefixes are used to determine which keys belong to the tracking scope of this index. It consists of an array of strings, where each string is a key prefix.

Tag field metadata

| namespace | FIELD_META | index name | field name |      | field flag | separator  | case sensitive |
|-----------|------------|------------|------------|  ->  |------------|------------|----------------|
| 1+X bytes |  1 byte    | 4+Y bytes  | 4+Z bytes  |  ->  |   1 byte   |  1 byte    |     1 byte     |

where separator currently can only be an ASCII character, and case sensitive can be 0 (false) or 1 (true).

Numeric field metadata

| namespace | FIELD_META | index name | field name |      | field flag |
|-----------|------------|------------|------------|  ->  |------------|
| 1+X bytes |  1 byte    | 4+Y bytes  | 4+Z bytes  |  ->  |   1 byte   |

HNSW Vector Field Metadata

This metadata format is specifically designed to support efficient vector search using the HNSW (Hierarchical Navigable Small World) algorithm. The encoding captures various parameters and settings relevant for managing the vector index properties and optimizing vector search operations.

| namespace | FIELD_META | index name | field name |      | field flag | vector type | dimension | distance metric | initial cap |     m     | ef construction | ef runtime | epsilon | number of levels |
|-----------|------------|------------|------------|  ->  |------------|-------------|-----------|-----------------|-------------|-----------|-----------------|------------|---------|------------------|
| 1+X bytes |  1 byte    | 4+Y bytes  | 4+Z bytes  |  ->  |   1 byte   |   1 byte    |  2 bytes  |     1 byte      |   4 bytes   |  2 bytes  |     4 bytes     |  4 bytes   | 8 bytes |     2 bytes      |

Required attributes

vector type: Specifies the type of vectors stored (e.g., FLOAT32, FLOAT64); Now Kvrocks only supports FLOAT64.
dimension: The dimensionality of the vectors (number of elements in each vector).
distance metric: Metric used for distance calculation between vectors (i.e. L2, IP, COSINE).

Optional attributes

initial cap: Initial capacity of the HNSW graph, indicating the initial number of elements; Default is 500000.
m: Maximum number of edges per node in the HNSW graph; Default is 16.
ef construction: Size of the dynamic candidate list during the index construction phase; Default is 200.
ef runtime: Size of the dynamic candidate list during the search phase; Default is 10.
epsilon: Epsilon value for approximate search, controlling the trade-off between search precision and speed; Default is 0.01.
number of levels: Number of levels in the HNSW graph, affecting the hierarchical structure of the graph.

Index data encoding

Index data refers to the information stored after indexing the real data, which is used to quickly get corresponding data in subsequent query processes.

Tag field

| namespace | FIELD   | index name | field name | tag value  | user key   |      |    null    |
|-----------|---------|------------|------------|------------|------------|  ->  |------------|
| 1+X bytes | 1 byte  | 4+Y bytes  | 4+Z bytes  | 4+A bytes  | 4+B bytes  |  ->  |   0 byte   |

Numeric field

| namespace | FIELD   | index name | field name | floating number | user key   |      |    null    |
|-----------|---------|------------|------------|-----------------|------------|  ->  |------------|
| 1+X bytes | 1 byte  | 4+Y bytes  | 4+Z bytes  |    8 bytes      | 4+B bytes  |  ->  |   0 byte   |

HNSW Vector field

HNSW graph entry types

hnsw type	enum value
NODE	1
EDGE	2

HNSW node index encoding

| namespace | FIELD   | index name | field name |   level   |    hnsw type   | user key   |      | num of neighbours | vector dimension |      vector data      |
|-----------|---------|------------|------------|-----------|----------------|------------|  ->  |-------------------|------------------|-----------------------|
| 1+X bytes | 1 byte  | 4+Y bytes  | 4+Z bytes  |  2 bytes  |  NODE (1 byte) | 4+B bytes  |  ->  |       2 bytes     |      2 bytes     |  dimension * 8 bytes  |

HNSW edge index encoding

| namespace | FIELD   | index name | field name |   level   |    hnsw type   | user key 1 | user key 2 |      |    null    |
|-----------|---------|------------|------------|-----------|----------------|------------|------------|  ->  |------------|
| 1+X bytes | 1 byte  | 4+Y bytes  | 4+Z bytes  |  2 bytes  |  EDGE (1 byte) | 4+B bytes  | 4+B bytes  |  ->  |   0 byte   |

where user key 1 and user key 2 represent the endpoints of an edge at a specific level within the HNSW graph.

Index encoding format for Kvrocks Search

Common encoding​

Key types​

Field types and flags​

Metadata encoding​

Index metadata​

Index prefixes​

Tag field metadata​

Numeric field metadata​

HNSW Vector Field Metadata​

Required attributes​

Optional attributes​

Index data encoding​

Tag field​

Numeric field​

HNSW Vector field​

HNSW graph entry types​

HNSW node index encoding​

HNSW edge index encoding​