I m a bit disappointed with my zstd test I expected better r kotlinlang #random

I'm a bit disappointed with my zstd test. I expect...

Stefan Oltmann

09/15/2025, 8:05 PM

I'm a bit disappointed with my zstd test. I expected better results, but perhaps it's just not well-suited for compressing small protobuf files. 🤔🧵

Stefan Oltmann

09/15/2025, 8:07 PM

From my ONI Seed Browser

Copy code

### CER-C-343676989-0-0-0
 -> JSON           = 14.713 KB
 -> JSON (GZIP  9) =  3.995 KB
 -> JSON (ZSTD 19) =  3.735 KB
 -> Protobuf       =  5.881 KB
--- --- --- --- ---
 -> Protobuf (GZIP  0) = 5.904 KB in 510.9us
 -> Protobuf (GZIP  1) = 3.141 KB in 186.1us
 -> Protobuf (GZIP  2) = 3.090 KB in 128.7us
 -> Protobuf (GZIP  3) = 3.056 KB in 143.4us
 -> Protobuf (GZIP  4) = 3.000 KB in 167.9us
 -> Protobuf (GZIP  5) = 2.949 KB in 216.1us
 -> Protobuf (GZIP  6) = 2.943 KB in 232.0us
 -> Protobuf (GZIP  7) = 2.943 KB in 242.5us
 -> Protobuf (GZIP  8) = 2.942 KB in 250.9us
 -> Protobuf (GZIP  9) = 2.942 KB in 252.0us
--- --- --- --- ---
 -> Protobuf (ZSTD  0) = 3.052 KB in 155.2us
 -> Protobuf (ZSTD  1) = 3.058 KB in 105.4us
 -> Protobuf (ZSTD  2) = 3.084 KB in  93.7us
 -> Protobuf (ZSTD  3) = 3.052 KB in 114.5us
 -> Protobuf (ZSTD  4) = 2.986 KB in 146.5us
 -> Protobuf (ZSTD  5) = 2.960 KB in 172.4us
 -> Protobuf (ZSTD  6) = 2.958 KB in 177.2us
 -> Protobuf (ZSTD  7) = 2.958 KB in 180.0us
 -> Protobuf (ZSTD  8) = 2.957 KB in 183.4us
 -> Protobuf (ZSTD  9) = 2.951 KB in 303.3us
 -> Protobuf (ZSTD 10) = 2.951 KB in 278.4us
 -> Protobuf (ZSTD 11) = 2.935 KB in 411.2us
 -> Protobuf (ZSTD 12) = 2.901 KB in 493.9us
 -> Protobuf (ZSTD 13) = 2.865 KB in 602.0us
 -> Protobuf (ZSTD 14) = 2.865 KB in 597.2us
 -> Protobuf (ZSTD 15) = 2.865 KB in 606.3us
 -> Protobuf (ZSTD 16) = 2.848 KB in 1.059700ms
 -> Protobuf (ZSTD 17) = 2.848 KB in 1.041300ms
 -> Protobuf (ZSTD 18) = 2.848 KB in 1.047400ms
 -> Protobuf (ZSTD 19) = 2.848 KB in 1.043600ms
 -> Protobuf (ZSTD 20) = 2.848 KB in 1.049100ms
 -> Protobuf (ZSTD 21) = 2.848 KB in 1.047400ms
 -> Protobuf (ZSTD 22) = 2.848 KB in 1.042800ms

Stefan Oltmann

09/15/2025, 8:09 PM

For gzip default compression is 5 (which is used on the maps right now). For zstandard default compression is 3 (which I plan to change). gzip > 6 and zstandard > 12 may not be worth it 2.848 KB is the min size I can achieve but for 4x CPU than highest gzip. To save 2.942 - 2.848 KB = 0.094 KB, which is 122 MB in savings for the 1.3 million maps I have on https://stefan-oltmann.de/oni-seed-browser To be honest: I expected to save more

Stefan Oltmann

09/15/2025, 8:10 PM

So gzip doesn't look as bad in this comparison. I know that I didn't measure memory usage here, just time and output size.

Stefan Oltmann

09/15/2025, 8:15 PM

I guess zstd is all about the efficiency (bytes/us) and not so much about compact storage size.

ephemient

09/15/2025, 8:26 PM

decompressor is possibly the best part of zstd, it's fast (faster than zlib) and constant memory regardless of the compression level

👍 1

Stefan Oltmann

09/15/2025, 8:27 PM

You're right... Decompression time is quite constant and independent of the compression. So the server only needs to spent CPU time once and I have long-time savings on S3 space (even if not big). 🤔

ephemient

09/15/2025, 8:27 PM

the compressor is pretty flexible but it doesn't really have obvious advantages over other modern compression methods

💡 1

Stefan Oltmann

09/15/2025, 8:27 PM

I somehow expected that after reading so many benchmark blogs ^^

Stefan Oltmann

09/15/2025, 8:28 PM

So it's all about the decompression ?

ephemient

09/15/2025, 8:30 PM

well the compressor is needed so that you can use the decompressor of course, and it can achieve pretty good results. but the decompression side is where it really takes the win over brotli for example

✅ 1

Stefan Oltmann

09/15/2025, 8:31 PM

Thank you. Now I understand! 🙂

Stefan Oltmann

09/15/2025, 8:34 PM

I added decompression times to my test. 30us on average for zstd, 70us on average for gzip - so more than twice as fast.

Youssef Shoaib [MOD]

09/15/2025, 8:41 PM

I have no clue what the format of your data is, but that'd be the first thing I'd personally attack. Figure out if your data can be expressed more compactly, removing any redundancies and such. Also, I think compression usually gets better the more data you have, so maybe compress chunks of 50 maps or so and see what those times are like

Stefan Oltmann

09/15/2025, 8:44 PM

Yes, that was the first thing I attacked. I improved data structures, switched to protobuf, dropped unnecessary data, use polybool to combine paths, etc. Maybe there is more I can do, but that's right now not obvious to me. Chunking of maps is not really possible, because the seed browser just requests one map per time. People can open an URL like https://stefanoltmann.de/oni-seed-browser/#FRST-C-509222068-0-0-0 and then a call so S3 is made where that exact protobuf file is downloaded from https://oni-data.stefanoltmann.de/FRST-C-509222068-0-0-0

Youssef Shoaib [MOD]

09/15/2025, 8:47 PM

Hmm, the question is then what matters to you here? Storing these few gigabytes is likely no issue. What probably matters is decompression time and the size of file that the client has to download. It'd be worth it then to go for more aggressive compression, although tbh on most connections that little bit of a difference won't change much.

Stefan Oltmann

09/15/2025, 8:49 PM

I want to stay in free tiers like Cloudflare R2's 10 GB to avoid paying for S3 storage. Previously the gzipped JSON data for 1.2 million maps was 5.5 GB. Switching to gzipped Protobuf and some format optimizations allowed me to go down to 4 GB again. Now I'm looking into saving more space with more aggressive compression. Just so I did everything possible on that front, too.

Stefan Oltmann

09/15/2025, 8:50 PM

People add 5000 new maps daily and I want to avoid hitting the limit for as long as possible. ^^

Youssef Shoaib [MOD]

09/15/2025, 8:50 PM

ohhhhh your data is increasing... yeah then you're doing the right thing

🙂 1

Stefan Oltmann

09/15/2025, 8:51 PM

You'd think the 1000 people using the my site daily would donate something to cover hosting costs, but they don't 😬

Stefan Oltmann

09/15/2025, 8:53 PM

My workmate also said that for all the hours I spent on optimizing the data format and looking for free hosting options I could have just paid AWS $50 and be good for some years... I know... ^^ But this is also a learning opportunity. 😄

Stefan Oltmann

09/15/2025, 8:54 PM

If you happen to have ideas how to optimize the data format further, any feedback is welcome. 🙂 https://github.com/StefanOltmann/oni-seed-browser-model/blob/main/oni-seed-browser[…]el/src/commonMain/kotlin/de/stefan_oltmann/oni/model/Cluster.kt

phldavies

09/15/2025, 8:55 PM

If you're compressing lots of small pieces of similar data you may get some traction out of a custom dictionary

➕ 1

Stefan Oltmann

09/15/2025, 8:59 PM

Thank you. I'm thinking about it. I'm not sure what values could be similar. They are mostly map coordinates and no two things are at the same place, so they are mostly unique.

Stefan Oltmann

09/15/2025, 9:01 PM

But I was thinking about making this a single byte array. Where in a pattern of three bytes (ordinal, q and r) everything is stored as just a proto packed ByteArray

Copy code

@Serializable
@OptIn(ExperimentalSerializationApi::class)
data class StarMapEntrySpacedOut(

    @ProtoNumber(1)
    val id: SpacedOutSpacePOI,

    @ProtoNumber(2)
    val q: Byte,

    @ProtoNumber(3)
    val r: Byte
)

Youssef Shoaib [MOD]

09/15/2025, 9:02 PM

You could store lower-res coordinates if it's intended for human use. You could even compress them down into 1 number by using a Hilbert curve or something (though I don't know if that's useful in practice). You could then have a button for higher-res coords when necessary

Stefan Oltmann

09/15/2025, 9:04 PM

Yeah, coordinates range from 1 to usually 800... They are int32, but that's mostly a waste. Too bad they don't fit into a byte (128). I could divide them by 8 10 to make them fit in, but I don't know if that will look to different when going back.

Stefan Oltmann

09/15/2025, 9:06 PM

728 / 10 = 72 72 * 10 = 720 ... Sounds not like a lot. Maybe worth a try.

Youssef Shoaib [MOD]

09/15/2025, 9:08 PM

Idk what the theory looks like on this, but maybe there's some clusturing algorithms or something you can apply to get further compression if things happen to be grouped together. The hilbert curve idea would work reasonably well here because it preserves locality between 2d and 1d, hence you could maybe use it and then use huffman coding or something. Another interesting data analysis thing you can do is figure out if your coords happen to be around a specific range. Huffman coding can then do its magic here.

phldavies

09/15/2025, 9:08 PM

If you can sort the coordinates you could delta-encode the set assuming they're deltas are smaller and fit in a single byte varint

Stefan Oltmann

09/15/2025, 9:08 PM

A game map usually looks like this

Stefan Oltmann

09/15/2025, 9:10 PM

If you can sort the coordinates you could delta-encode the set assuming they're deltas are smaller and fit in a single byte varint

Good idea. I did that with the points of the biome vectors (the colors in the back). That saved a lot. Maybe that's also possible for the coordinates.

Youssef Shoaib [MOD]

09/15/2025, 9:11 PM

It looks somewhat clustured, interesting... The items look big enough that a 10-20 coord difference would change nothing. If you want to go really crazy, look into AI-based compression. Here's a

video▾

that goes thru what that looks like for images. TL;DW: you have a layer of your model that's restricted to whatever output size you want. Then, you train your model with an apt loss function that prioritises spatial closeness. Finally, you split your model at that layer, and you thus have a compresssor that produces something in the restricted output size, and a decompressor that does a rather good job

Stefan Oltmann

09/15/2025, 9:13 PM

Wow, that's impressive. 😄 But no, I want to keep that simple ^^

Stefan Oltmann

09/15/2025, 9:14 PM

The next thing I should really try is how different maps look if I reduce resolution of X & Y to fit into byte

Stefan Oltmann

09/15/2025, 9:15 PM

I'm right now kill my MongoDB instance by quering for max asteroid size over all 1.3 million maps

Youssef Shoaib [MOD]

09/15/2025, 9:18 PM

if there's a correlation between those background elements and the items, that can be something you can exploit too. It looks like items end up on the edges between the background elements, so that can be a very simple 16*16 encoding. On the other hand, that's 2 bytes, so maybe just lowering co-ord resolution will give better results. This is the type of statistical correlation that a tiny model might be able to exploit... just saying.

Stefan Oltmann

09/15/2025, 9:20 PM

Interesting obversation. I think that's random here. I never noticed such a pattern in other maps

Stefan Oltmann

09/15/2025, 9:22 PM

Oh nice, I remembered the maps to be bigger, but it looks like map sizes are just 3 times of byte's range. So not a big reduction in resolution probably

Stefan Oltmann

09/15/2025, 9:22 PM

max x for geysers is 251 and max y 380

Stefan Oltmann

09/15/2025, 9:24 PM

So not just these, but also asteroid size, offsets... I could potentially divide everything by 3 to make the numbers smaller 💡

Stefan Oltmann

09/15/2025, 9:25 PM

So making the data types bytes should als beneficial for heap memory

Stefan Oltmann

09/15/2025, 9:33 PM

Asteroid Max size: 256, Max height: 384

Youssef Shoaib [MOD]

09/15/2025, 9:34 PM

Wait, your data is all based on a seed right? Presumably, couldn't you ignore some of the information like coords (which aren't important for searching), and generate them client-side? Your data is algorithmic already lol! All you need is make your world-gen recreation algorithm fast and small enough. You can even distill it to the bare basics and only store information like ranking and what type of biomes are there (apologies for the likely incorrect ONI terminology lol), basically only what search and ranking require, and everything else can be computed client side

Stefan Oltmann

09/15/2025, 9:36 PM

Unfortunately not. Recreating the worldgen of Oxygen Not Included was the first idea, but the guy who made the world capture mod said that worldgen is 4000+ lines of messy code that even the devs don't really understand. So we can't replicate that. We let a mod run the game and collect the data for us.

Stefan Oltmann

09/15/2025, 9:38 PM

Messy C# code that we even don't have a copyright for.

Youssef Shoaib [MOD]

09/15/2025, 9:39 PM

Ooooo copyright may be an issue yes, although there's some "you may reverse engineer for the sake of making compatible products" law in some jurisdictions like the UK (IANAL though). You do have the best dataset to attempt such a recreation :)

Stefan Oltmann

09/15/2025, 9:40 PM

Even if I succeed, the game changes a lot. Nothing I'd like to lose my sanity about. 😄

😆 1

Stefan Oltmann

09/15/2025, 9:41 PM

Asteroid Max size: 256, Max height: 384

So interesting that these two values are exact multiples of 128. That tells me something about their data structures

Stefan Oltmann

09/15/2025, 9:52 PM

Reducing resolution doesn't have a huge impact here. Divided by 3.

Copy code

### CER-C-343676989-0-0-0
 -> JSON = 14.665 KB
 -> JSON (GZIP 9) = 3.958 KB
 -> JSON (ZSTD 19) = 3.696 KB
 -> Protobuf = 5.864 KB
--- --- --- --- ---
 -> Protobuf (GZIP 0) = 5.887 KB in 433.4us (decompression = 478.2us)
 -> Protobuf (GZIP 1) = 3.107 KB in 171.6us (decompression = 81.8us)
 -> Protobuf (GZIP 2) = 3.066 KB in 126.8us (decompression = 67.2us)
 -> Protobuf (GZIP 3) = 3.029 KB in 140.6us (decompression = 70.9us)
 -> Protobuf (GZIP 4) = 2.98 KB in 157.6us (decompression = 70.9us)
 -> Protobuf (GZIP 5) = 2.925 KB in 215.3us (decompression = 72.4us)
 -> Protobuf (GZIP 6) = 2.923 KB in 241.6us (decompression = 75.3us)
 -> Protobuf (GZIP 7) = 2.923 KB in 244.9us (decompression = 62.9us)
 -> Protobuf (GZIP 8) = 2.922 KB in 248.4us (decompression = 84.5us)
 -> Protobuf (GZIP 9) = 2.922 KB in 258.7us (decompression = 71us)
--- --- --- --- ---
 -> Protobuf (ZSTD 0) = 3.037 KB in 155.2us (decompression = 583.9us)
 -> Protobuf (ZSTD 1) = 3.042 KB in 108.7us (decompression = 29.2us)
 -> Protobuf (ZSTD 2) = 3.069 KB in 90.8us (decompression = 29.4us)
 -> Protobuf (ZSTD 3) = 3.037 KB in 116.7us (decompression = 27.5us)
 -> Protobuf (ZSTD 4) = 2.975 KB in 152.4us (decompression = 26.8us)
 -> Protobuf (ZSTD 5) = 2.95 KB in 162.4us (decompression = 26.5us)
 -> Protobuf (ZSTD 6) = 2.943 KB in 185.5us (decompression = 26.2us)
 -> Protobuf (ZSTD 7) = 2.936 KB in 180us (decompression = 30.2us)
 -> Protobuf (ZSTD 8) = 2.936 KB in 180.7us (decompression = 25.6us)
 -> Protobuf (ZSTD 9) = 2.934 KB in 293.6us (decompression = 26.9us)
 -> Protobuf (ZSTD 10) = 2.934 KB in 270.9us (decompression = 26.3us)
 -> Protobuf (ZSTD 11) = 2.92 KB in 401.3us (decompression = 30.8us)
 -> Protobuf (ZSTD 12) = 2.878 KB in 495.4us (decompression = 29.6us)
 -> Protobuf (ZSTD 13) = 2.843 KB in 598.3us (decompression = 29.4us)
 -> Protobuf (ZSTD 14) = 2.843 KB in 591.3us (decompression = 27.2us)
 -> Protobuf (ZSTD 15) = 2.843 KB in 598.4us (decompression = 27.9us)
 -> Protobuf (ZSTD 16) = 2.82 KB in 1.048400ms (decompression = 27.9us)
 -> Protobuf (ZSTD 17) = 2.82 KB in 1.050200ms (decompression = 28.7us)
 -> Protobuf (ZSTD 18) = 2.82 KB in 1.041900ms (decompression = 27.6us)
 -> Protobuf (ZSTD 19) = 2.82 KB in 1.049ms (decompression = 27.6us)
 -> Protobuf (ZSTD 20) = 2.82 KB in 1.055300ms (decompression = 27.8us)
 -> Protobuf (ZSTD 21) = 2.82 KB in 1.043900ms (decompression = 28.1us)
 -> Protobuf (ZSTD 22) = 2.82 KB in 1.054900ms (decompression = 28.3us)

Stefan Oltmann

09/15/2025, 9:55 PM

Previously I had a float value in there. I saw protobuf made that a 4 bytes fixed. So I changed it to a rounded short. That had a noticable effect as protobuf optimized integers (varint), but not floats.

Stefan Oltmann

09/16/2025, 9:38 PM

Interesting... I had expected that this increased usage of

@ProtoPacked

would have had more effect ... https://github.com/StefanOltmann/oni-seed-browser-backend/commit/132591d5793f196f74b0a58ece5d87c19d541c8b Normal:

Copy code

### CER-C-343676989-0-0-0
 -> JSON           = 14.473 KB
 -> JSON (GZIP  9) =  3.935 KB
 -> JSON (ZSTD 19) =  3.690 KB
 -> Protobuf       =  5.833 KB
--- --- --- --- ---
 -> Protobuf (GZIP  0) = 5.856 KB in 1.129900ms (decompression = 467.8us)
 -> Protobuf (GZIP  9) = 2.914 KB in 273.7us    (decompression =  74.0us)
--- --- --- --- ---
 -> Protobuf (ZSTD  0) = 3.018 KB in 154.6us    (decompression = 538.2us)
 -> Protobuf (ZSTD 22) = 2.820 KB in 1.103500ms (decompression =  35.5us)

Compacted:

Copy code

### CER-C-343676989-0-0-0
 -> JSON           = 8.423 KB
 -> JSON (GZIP  9) = 3.228 KB
 -> JSON (ZSTD 19) = 3.095 KB
 -> Protobuf       = 5.053 KB
--- --- --- --- ---
 -> Protobuf (GZIP 0) = 5.076 KB in 349.0us (decompression = 460.6us)
 -> Protobuf (GZIP 9) = 2.710 KB in 229.2us (decompression =  65.5us)
--- --- --- --- ---
 -> Protobuf (ZSTD  0) = 2.719 KB in 139.7us (decompression = 548.0us)
 -> Protobuf (ZSTD 22) = 2.581 KB in 879.5us (decompression =  24.8us)

Stefan Oltmann

09/16/2025, 9:41 PM

I guess I overestimated what Protobuf's repeated field type declarations really take. Now reaching 1 KB seems more and more unrealistic.

Stefan Oltmann

09/16/2025, 9:52 PM

Just reducing the resolution (dividing by 4) didn't change much, too.

Copy code

### CER-C-343676989-0-0-0
 -> JSON           = 14.397 KB
 -> JSON (GZIP  9) =  3.880 KB
 -> JSON (ZSTD 19) =  3.620 KB
 -> Protobuf       =  5.809 KB
--- --- --- --- ---
 -> Protobuf (GZIP  0) = 5.832 KB in 464.8us (decompression = 504.8us)
 -> Protobuf (GZIP  9) = 2.861 KB in 255.2us (decompression = 65us)
--- --- --- --- ---
 -> Protobuf (ZSTD  0) = 2.977 KB in 154.5us (decompression = 578.6us)
 -> Protobuf (ZSTD 22) = 2.758 KB in 1.050900ms (decompression = 30.7us)

Stefan Oltmann

09/16/2025, 9:55 PM

The biggest part of the maps are the biome paths.

Copy code

16:80 135 0 -91 -16 0 -4 -6 -9 -2 -11 10 -11 -9 -9 2 -4 8 -16 0 0 93 13 0 7 -7 4 0 8 10 13 -3 3 3 13 -2 4 -6|80 174 0 -21 -13 1 -3 4 -14 1 -2 -2 -14 2 -6 -5 -10 1 -2 2 -16 1 0 16 12:80 153 0 -18 -15 0 -4 6 -13 2 -3 -3 -13 3 -8 -10 -4 0 -7 7 -13 0 0 18 16 -1 2 -2 10 -1 6 5 14 -2 2 2 14 -1 3 -4

That's already polybooled & delta encoded. I save it as string, because (to my surprise) I wasn't able to make a smaller protobuf version out of it.

Stefan Oltmann

09/16/2025, 10:03 PM

Just to give an idea how big they are, here the same map with them removed:

Copy code

### CER-C-343676989-0-0-0
 -> JSON           = 10.590 KB
 -> JSON (GZIP  9) =  2.287 KB
 -> JSON (ZSTD 19) =  2.130 KB
 -> Protobuf       =  1.960 KB
--- --- --- --- ---
 -> Protobuf (GZIP  0) = 1.983 KB in 348.2us (decompression = 422.2us)
 -> Protobuf (GZIP  1) = 1.317 KB in  80.4us (decompression = 49.2us)
 -> Protobuf (GZIP  5) = 1.294 KB in  73.0us (decompression = 43.9us)
 -> Protobuf (GZIP  9) = 1.290 KB in  78.0us (decompression = 37.7us)
--- --- --- --- ---
 -> Protobuf (ZSTD  0) = 1.343 KB in  96.9us (decompression = 492.8us)
 -> Protobuf (ZSTD  1) = 1.370 KB in  63.3us (decompression = 21us)
 -> Protobuf (ZSTD  2) = 1.355 KB in  45.6us (decompression = 19.8us)
 -> Protobuf (ZSTD  3) = 1.343 KB in  51.4us (decompression = 19.6us)
 -> Protobuf (ZSTD  4) = 1.336 KB in  71.3us (decompression = 20.2us)
 -> Protobuf (ZSTD  5) = 1.327 KB in  81.3us (decompression = 26.1us)
 -> Protobuf (ZSTD 22) = 1.267 KB in 274.4us (decompression = 22.7us)

Stefan Oltmann

09/16/2025, 10:29 PM

I think I already reached some kind of scientific level with my optimizations 😆 It may be good enough already. Just re-compress with strongest zstd and be fine. The gains I saw from a more compact object model seem not worth it to have a rather strange object model I believe. People should still be able to make sense out of it.

Stefan Oltmann

09/16/2025, 10:31 PM

And should I need to pay for space I at least know that I’m not accidentally wasting space by having unoptimized data.

ephemient

10/07/2025, 1:06 AM

https://engineering.fb.com/2025/10/06/developer-tools/openzl-open-source-format-aware-compression-framework/

🎉 1

👍 1

10 Views

Open in Slack

Previous Next