How to Compact Data in Milvus

Cover image.

How to Compact Data in Milvus

With the official release of Milvus 2.0 GA, a list of new features is supported. Among those, compaction is one of the new features that can help you save storage space.

Compaction refers to the process of merging small segments into large ones and cleaning logically deleted data. In other words, compaction reduces the usage of disk space by purging the deleted or expired entities in binlogs. It is a background task that is triggered by data coord and executed by the data node in Milvus.

This article dissects the concept and implementation of compaction in Milvus.

What Is Compaction?

Before going deep into the details of how to implement compaction in Milvus 2.0, it is critical to figure out what compaction is in Milvus.

More often than not, as a Milvus user, you might have been bothered by the increasing usage of hard disk space. Another issue is that a segment with less than 1,024 rows is not indexed and only supports brute-force search to process queries. Small segments caused by auto-flush or user-invoked might hamper query flush efficiency.

Therefore, to solve the two issues mentioned above and help reduce disk usage and improve query efficiency, Milvus supports compaction.

Databases like LevelDB and RocksDB append data to sorted strings tables (SSTables). The average disk reads per query increase with the number of SSTables, leading to inefficient queries. To reduce read amplification and release hard drive space, these databases compact SSTables into one. Compaction processes run in the background automatically.

Similarly, Milvus appends inserted and deleted data to binlogs. As the number of binlogs increases, more hard disk space is used. To release hard disk space, Milvus compacts binlogs of deleted and inserted data. If an entity is inserted but later deleted, it no longer exists in the binlogs that records data insertion or deletion once compacted. In addition, Milvus also compacts segments – data files automatically created by Milvus for holding inserted data.

How Do You Configure Compaction?

Configuration of compaction in Milvus mainly involves two parameters: dataCoord.enableCompaction and common.retentionDuration.

dataCoord.enableCompaction specifies whether to enable compaction. Its default value is true.

common.retentionDuration specifies a period when compaction does not run. Its unit is second. When you compact data, all deleted entities will be made unavailable for search with Time Travel. Therefore, if you plan to search with Time Travel, you have to specify a period of time during which compaction does not run and does not affect deleted data. To ensure accurate results of searches with Time Travel, Milvus retains data operated in a period specified by common.retentionDuration. That is, data operated in this period will not be compacted. For more details, see Search with Time Travel.

Compaction is enabled in Milvus by default. If you disabled compaction but later want to manually enable it, you can follow the steps below:

  1. Call the collection.compact() method to trigger a global compaction process manually. However, please be noted that this operation might take a long time.
  2. After calling the method, a compaction ID is returned. View the compaction status by calling the collection.get_compaction_state() method.

After compaction is enabled, it runs in the background automatically. Since the compaction process might take a long time, compaction requests are processed asynchronously to save time.

How to Implement Compaction

In Milvus, you can either implement compaction manually or automatically.

Manual compaction of binlogs or segments does not require meeting any trigger conditions. Therefore, if you manually invoke compaction, the binlogs or segments will be compacted no matter what.

However, if you want to enable automatic compaction, certain compaction trigger conditions need to be met in order for the system to compact your segments or binlogs.

Generally, there are two types of objects that can be compacted in Milvus: binlogs and segments.

Binlog Compaction

A binlog is a binary log, or a smaller unit in segment, that records and handles the updates and changes made to data in the Milvus vector database. Data from a segment is persisted in multiple binlogs. Binlog compaction involves two types of binlogs in Milvus: insert binlogs and delta binlogs.

Delta binlogs are generated when data is deleted while insert binlogs are generated under the following three circumstances.

  • As inserted data is being appended, the segment reaches the upper limit of size and is automatically flushed to the disk.
  • DataCoord automatically flushes segments that stay unsealed for a long time.
  • Some APIs like collection.num_entities, collection.load()and more automatically invoke flush to write segments to disk.

Therefore, binlog compaction, as its name suggests, refers to compacting binlogs within a segment. More specifically, during binlog compaction, all delta binlogs and insert binlogs that are not retained are compacted.

When a segment is flushed to disk, or when Milvus requests global compaction as compaction has not run for a long time, at least one of the following two conditions need to be met to trigger automatic compaction:

  1. Rows in delta binlogs are more than 20% of the total rows.
  2. The size of delta binlogs exceeds 10 MB.

Segment Compaction

A segment is a data file automatically created by Milvus for holding inserted data. There are two types of segments in Milvus: growing segment and sealed segment.

A growing segment keeps receiving the newly inserted data until it is sealed. A sealed segment no longer receives any new data and will be flushed to the object storage, new data to be leaving into a newly created growing segment.

Therefore, segment compaction refers to compacting multiple sealed segments. More specifically, during segment compaction, small segments are compacted into bigger ones.

Each segment generated after compaction cannot exceed the upper limit of a segment size, which is 512 MB by default. Read system configurations to learn how to modify the upper limit of segment size.

When a segment flushes to disk, or when Milvus requests global compaction as compaction has not run for a long time, the following condition needs to be met to trigger automatic compaction:

  • Segments smaller than 0.5 * MaxSegmentSize is more than 10.

What’s Next?

What’s next after learning the basics of compaction in Milvus? Currently, not all parameters for configuring compaction are in the milvus.yaml file, and plan generation strategies are relatively basic. Come and contribute to Milvus, the open-source project if you are interested!

Also, this is one of the articles in the blog series introducing the new features in Milvus 2.0. Read more in this blog series:

One more article about load balance will be coming soon. Please stay tuned!


Leave a Comment