I'm a bit disappointed with my zstd test. I expect...
# random
s
I'm a bit disappointed with my zstd test. I expected better results, but perhaps it's just not well-suited for compressing small protobuf files. πŸ€”πŸ§΅
From my ONI Seed Browser
Copy code
### CER-C-343676989-0-0-0
 -> JSON           = 14.713 KB
 -> JSON (GZIP  9) =  3.995 KB
 -> JSON (ZSTD 19) =  3.735 KB
 -> Protobuf       =  5.881 KB
--- --- --- --- ---
 -> Protobuf (GZIP  0) = 5.904 KB in 510.9us
 -> Protobuf (GZIP  1) = 3.141 KB in 186.1us
 -> Protobuf (GZIP  2) = 3.090 KB in 128.7us
 -> Protobuf (GZIP  3) = 3.056 KB in 143.4us
 -> Protobuf (GZIP  4) = 3.000 KB in 167.9us
 -> Protobuf (GZIP  5) = 2.949 KB in 216.1us
 -> Protobuf (GZIP  6) = 2.943 KB in 232.0us
 -> Protobuf (GZIP  7) = 2.943 KB in 242.5us
 -> Protobuf (GZIP  8) = 2.942 KB in 250.9us
 -> Protobuf (GZIP  9) = 2.942 KB in 252.0us
--- --- --- --- ---
 -> Protobuf (ZSTD  0) = 3.052 KB in 155.2us
 -> Protobuf (ZSTD  1) = 3.058 KB in 105.4us
 -> Protobuf (ZSTD  2) = 3.084 KB in  93.7us
 -> Protobuf (ZSTD  3) = 3.052 KB in 114.5us
 -> Protobuf (ZSTD  4) = 2.986 KB in 146.5us
 -> Protobuf (ZSTD  5) = 2.960 KB in 172.4us
 -> Protobuf (ZSTD  6) = 2.958 KB in 177.2us
 -> Protobuf (ZSTD  7) = 2.958 KB in 180.0us
 -> Protobuf (ZSTD  8) = 2.957 KB in 183.4us
 -> Protobuf (ZSTD  9) = 2.951 KB in 303.3us
 -> Protobuf (ZSTD 10) = 2.951 KB in 278.4us
 -> Protobuf (ZSTD 11) = 2.935 KB in 411.2us
 -> Protobuf (ZSTD 12) = 2.901 KB in 493.9us
 -> Protobuf (ZSTD 13) = 2.865 KB in 602.0us
 -> Protobuf (ZSTD 14) = 2.865 KB in 597.2us
 -> Protobuf (ZSTD 15) = 2.865 KB in 606.3us
 -> Protobuf (ZSTD 16) = 2.848 KB in 1.059700ms
 -> Protobuf (ZSTD 17) = 2.848 KB in 1.041300ms
 -> Protobuf (ZSTD 18) = 2.848 KB in 1.047400ms
 -> Protobuf (ZSTD 19) = 2.848 KB in 1.043600ms
 -> Protobuf (ZSTD 20) = 2.848 KB in 1.049100ms
 -> Protobuf (ZSTD 21) = 2.848 KB in 1.047400ms
 -> Protobuf (ZSTD 22) = 2.848 KB in 1.042800ms
For gzip default compression is 5 (which is used on the maps right now). For zstandard default compression is 3 (which I plan to change). gzip > 6 and zstandard > 12 may not be worth it 2.848 KB is the min size I can achieve but for 4x CPU than highest gzip. To save 2.942 - 2.848 KB = 0.094 KB, which is 122 MB in savings for the 1.3 million maps I have on https://stefan-oltmann.de/oni-seed-browser To be honest: I expected to save more
So gzip doesn't look as bad in this comparison. I know that I didn't measure memory usage here, just time and output size.
I guess zstd is all about the efficiency (bytes/us) and not so much about compact storage size.
e
decompressor is possibly the best part of zstd, it's fast (faster than zlib) and constant memory regardless of the compression level
πŸ‘ 1
s
You're right... Decompression time is quite constant and independent of the compression. So the server only needs to spent CPU time once and I have long-time savings on S3 space (even if not big). πŸ€”
e
the compressor is pretty flexible but it doesn't really have obvious advantages over other modern compression methods
πŸ’‘ 1
s
I somehow expected that after reading so many benchmark blogs ^^
So it's all about the decompression ?
e
well the compressor is needed so that you can use the decompressor of course, and it can achieve pretty good results. but the decompression side is where it really takes the win over brotli for example
βœ… 1
s
Thank you. Now I understand! πŸ™‚
I added decompression times to my test. 30us on average for zstd, 70us on average for gzip - so more than twice as fast.
y
I have no clue what the format of your data is, but that'd be the first thing I'd personally attack. Figure out if your data can be expressed more compactly, removing any redundancies and such. Also, I think compression usually gets better the more data you have, so maybe compress chunks of 50 maps or so and see what those times are like
s
Yes, that was the first thing I attacked. I improved data structures, switched to protobuf, dropped unnecessary data, use polybool to combine paths, etc. Maybe there is more I can do, but that's right now not obvious to me. Chunking of maps is not really possible, because the seed browser just requests one map per time. People can open an URL like https://stefanoltmann.de/oni-seed-browser/#FRST-C-509222068-0-0-0 and then a call so S3 is made where that exact protobuf file is downloaded from https://oni-data.stefanoltmann.de/FRST-C-509222068-0-0-0
y
Hmm, the question is then what matters to you here? Storing these few gigabytes is likely no issue. What probably matters is decompression time and the size of file that the client has to download. It'd be worth it then to go for more aggressive compression, although tbh on most connections that little bit of a difference won't change much.
s
I want to stay in free tiers like Cloudflare R2's 10 GB to avoid paying for S3 storage. Previously the gzipped JSON data for 1.2 million maps was 5.5 GB. Switching to gzipped Protobuf and some format optimizations allowed me to go down to 4 GB again. Now I'm looking into saving more space with more aggressive compression. Just so I did everything possible on that front, too.
People add 5000 new maps daily and I want to avoid hitting the limit for as long as possible. ^^
y
ohhhhh your data is increasing... yeah then you're doing the right thing
πŸ™‚ 1
s
You'd think the 1000 people using the my site daily would donate something to cover hosting costs, but they don't 😬
My workmate also said that for all the hours I spent on optimizing the data format and looking for free hosting options I could have just paid AWS $50 and be good for some years... I know... ^^ But this is also a learning opportunity. πŸ˜„
If you happen to have ideas how to optimize the data format further, any feedback is welcome. πŸ™‚ https://github.com/StefanOltmann/oni-seed-browser-model/blob/main/oni-seed-browser[…]el/src/commonMain/kotlin/de/stefan_oltmann/oni/model/Cluster.kt
p
If you're compressing lots of small pieces of similar data you may get some traction out of a custom dictionary
βž• 1
s
Thank you. I'm thinking about it. I'm not sure what values could be similar. They are mostly map coordinates and no two things are at the same place, so they are mostly unique.
But I was thinking about making this a single byte array. Where in a pattern of three bytes (ordinal, q and r) everything is stored as just a proto packed ByteArray
Copy code
@Serializable
@OptIn(ExperimentalSerializationApi::class)
data class StarMapEntrySpacedOut(

    @ProtoNumber(1)
    val id: SpacedOutSpacePOI,

    @ProtoNumber(2)
    val q: Byte,

    @ProtoNumber(3)
    val r: Byte
)
y
You could store lower-res coordinates if it's intended for human use. You could even compress them down into 1 number by using a Hilbert curve or something (though I don't know if that's useful in practice). You could then have a button for higher-res coords when necessary
s
Yeah, coordinates range from 1 to usually 800... They are int32, but that's mostly a waste. Too bad they don't fit into a byte (128). I could divide them by 8 10 to make them fit in, but I don't know if that will look to different when going back.
728 / 10 = 72 72 * 10 = 720 ... Sounds not like a lot. Maybe worth a try.
y
Idk what the theory looks like on this, but maybe there's some clusturing algorithms or something you can apply to get further compression if things happen to be grouped together. The hilbert curve idea would work reasonably well here because it preserves locality between 2d and 1d, hence you could maybe use it and then use huffman coding or something. Another interesting data analysis thing you can do is figure out if your coords happen to be around a specific range. Huffman coding can then do its magic here.
p
If you can sort the coordinates you could delta-encode the set assuming they're deltas are smaller and fit in a single byte varint
s
A game map usually looks like this
If you can sort the coordinates you could delta-encode the set assuming they're deltas are smaller and fit in a single byte varint
Good idea. I did that with the points of the biome vectors (the colors in the back). That saved a lot. Maybe that's also possible for the coordinates.
y
It looks somewhat clustured, interesting... The items look big enough that a 10-20 coord difference would change nothing. If you want to go really crazy, look into AI-based compression. Here's a

videoβ–Ύ

that goes thru what that looks like for images. TL;DW: you have a layer of your model that's restricted to whatever output size you want. Then, you train your model with an apt loss function that prioritises spatial closeness. Finally, you split your model at that layer, and you thus have a compresssor that produces something in the restricted output size, and a decompressor that does a rather good job
s
Wow, that's impressive. πŸ˜„ But no, I want to keep that simple ^^
The next thing I should really try is how different maps look if I reduce resolution of X & Y to fit into byte
I'm right now kill my MongoDB instance by quering for max asteroid size over all 1.3 million maps
y
if there's a correlation between those background elements and the items, that can be something you can exploit too. It looks like items end up on the edges between the background elements, so that can be a very simple 16*16 encoding. On the other hand, that's 2 bytes, so maybe just lowering co-ord resolution will give better results. This is the type of statistical correlation that a tiny model might be able to exploit... just saying.
s
Interesting obversation. I think that's random here. I never noticed such a pattern in other maps
Oh nice, I remembered the maps to be bigger, but it looks like map sizes are just 3 times of byte's range. So not a big reduction in resolution probably
max x for geysers is 251 and max y 380
So not just these, but also asteroid size, offsets... I could potentially divide everything by 3 to make the numbers smaller πŸ’‘
So making the data types bytes should als beneficial for heap memory
Asteroid Max size: 256, Max height: 384
y
Wait, your data is all based on a seed right? Presumably, couldn't you ignore some of the information like coords (which aren't important for searching), and generate them client-side? Your data is algorithmic already lol! All you need is make your world-gen recreation algorithm fast and small enough. You can even distill it to the bare basics and only store information like ranking and what type of biomes are there (apologies for the likely incorrect ONI terminology lol), basically only what search and ranking require, and everything else can be computed client side
s
Unfortunately not. Recreating the worldgen of Oxygen Not Included was the first idea, but the guy who made the world capture mod said that worldgen is 4000+ lines of messy code that even the devs don't really understand. So we can't replicate that. We let a mod run the game and collect the data for us.
Messy C# code that we even don't have a copyright for.
y
Ooooo copyright may be an issue yes, although there's some "you may reverse engineer for the sake of making compatible products" law in some jurisdictions like the UK (IANAL though). You do have the best dataset to attempt such a recreation :)
s
Even if I succeed, the game changes a lot. Nothing I'd like to lose my sanity about. πŸ˜„
πŸ˜† 1
Asteroid Max size: 256, Max height: 384
So interesting that these two values are exact multiples of 128. That tells me something about their data structures
Reducing resolution doesn't have a huge impact here. Divided by 3.
Copy code
### CER-C-343676989-0-0-0
 -> JSON = 14.665 KB
 -> JSON (GZIP 9) = 3.958 KB
 -> JSON (ZSTD 19) = 3.696 KB
 -> Protobuf = 5.864 KB
--- --- --- --- ---
 -> Protobuf (GZIP 0) = 5.887 KB in 433.4us (decompression = 478.2us)
 -> Protobuf (GZIP 1) = 3.107 KB in 171.6us (decompression = 81.8us)
 -> Protobuf (GZIP 2) = 3.066 KB in 126.8us (decompression = 67.2us)
 -> Protobuf (GZIP 3) = 3.029 KB in 140.6us (decompression = 70.9us)
 -> Protobuf (GZIP 4) = 2.98 KB in 157.6us (decompression = 70.9us)
 -> Protobuf (GZIP 5) = 2.925 KB in 215.3us (decompression = 72.4us)
 -> Protobuf (GZIP 6) = 2.923 KB in 241.6us (decompression = 75.3us)
 -> Protobuf (GZIP 7) = 2.923 KB in 244.9us (decompression = 62.9us)
 -> Protobuf (GZIP 8) = 2.922 KB in 248.4us (decompression = 84.5us)
 -> Protobuf (GZIP 9) = 2.922 KB in 258.7us (decompression = 71us)
--- --- --- --- ---
 -> Protobuf (ZSTD 0) = 3.037 KB in 155.2us (decompression = 583.9us)
 -> Protobuf (ZSTD 1) = 3.042 KB in 108.7us (decompression = 29.2us)
 -> Protobuf (ZSTD 2) = 3.069 KB in 90.8us (decompression = 29.4us)
 -> Protobuf (ZSTD 3) = 3.037 KB in 116.7us (decompression = 27.5us)
 -> Protobuf (ZSTD 4) = 2.975 KB in 152.4us (decompression = 26.8us)
 -> Protobuf (ZSTD 5) = 2.95 KB in 162.4us (decompression = 26.5us)
 -> Protobuf (ZSTD 6) = 2.943 KB in 185.5us (decompression = 26.2us)
 -> Protobuf (ZSTD 7) = 2.936 KB in 180us (decompression = 30.2us)
 -> Protobuf (ZSTD 8) = 2.936 KB in 180.7us (decompression = 25.6us)
 -> Protobuf (ZSTD 9) = 2.934 KB in 293.6us (decompression = 26.9us)
 -> Protobuf (ZSTD 10) = 2.934 KB in 270.9us (decompression = 26.3us)
 -> Protobuf (ZSTD 11) = 2.92 KB in 401.3us (decompression = 30.8us)
 -> Protobuf (ZSTD 12) = 2.878 KB in 495.4us (decompression = 29.6us)
 -> Protobuf (ZSTD 13) = 2.843 KB in 598.3us (decompression = 29.4us)
 -> Protobuf (ZSTD 14) = 2.843 KB in 591.3us (decompression = 27.2us)
 -> Protobuf (ZSTD 15) = 2.843 KB in 598.4us (decompression = 27.9us)
 -> Protobuf (ZSTD 16) = 2.82 KB in 1.048400ms (decompression = 27.9us)
 -> Protobuf (ZSTD 17) = 2.82 KB in 1.050200ms (decompression = 28.7us)
 -> Protobuf (ZSTD 18) = 2.82 KB in 1.041900ms (decompression = 27.6us)
 -> Protobuf (ZSTD 19) = 2.82 KB in 1.049ms (decompression = 27.6us)
 -> Protobuf (ZSTD 20) = 2.82 KB in 1.055300ms (decompression = 27.8us)
 -> Protobuf (ZSTD 21) = 2.82 KB in 1.043900ms (decompression = 28.1us)
 -> Protobuf (ZSTD 22) = 2.82 KB in 1.054900ms (decompression = 28.3us)
Previously I had a float value in there. I saw protobuf made that a 4 bytes fixed. So I changed it to a rounded short. That had a noticable effect as protobuf optimized integers (varint), but not floats.
Interesting... I had expected that this increased usage of
@ProtoPacked
would have had more effect ... https://github.com/StefanOltmann/oni-seed-browser-backend/commit/132591d5793f196f74b0a58ece5d87c19d541c8b Normal:
Copy code
### CER-C-343676989-0-0-0
 -> JSON           = 14.473 KB
 -> JSON (GZIP  9) =  3.935 KB
 -> JSON (ZSTD 19) =  3.690 KB
 -> Protobuf       =  5.833 KB
--- --- --- --- ---
 -> Protobuf (GZIP  0) = 5.856 KB in 1.129900ms (decompression = 467.8us)
 -> Protobuf (GZIP  9) = 2.914 KB in 273.7us    (decompression =  74.0us)
--- --- --- --- ---
 -> Protobuf (ZSTD  0) = 3.018 KB in 154.6us    (decompression = 538.2us)
 -> Protobuf (ZSTD 22) = 2.820 KB in 1.103500ms (decompression =  35.5us)
Compacted:
Copy code
### CER-C-343676989-0-0-0
 -> JSON           = 8.423 KB
 -> JSON (GZIP  9) = 3.228 KB
 -> JSON (ZSTD 19) = 3.095 KB
 -> Protobuf       = 5.053 KB
--- --- --- --- ---
 -> Protobuf (GZIP 0) = 5.076 KB in 349.0us (decompression = 460.6us)
 -> Protobuf (GZIP 9) = 2.710 KB in 229.2us (decompression =  65.5us)
--- --- --- --- ---
 -> Protobuf (ZSTD  0) = 2.719 KB in 139.7us (decompression = 548.0us)
 -> Protobuf (ZSTD 22) = 2.581 KB in 879.5us (decompression =  24.8us)
I guess I overestimated what Protobuf's repeated field type declarations really take. Now reaching 1 KB seems more and more unrealistic.
Just reducing the resolution (dividing by 4) didn't change much, too.
Copy code
### CER-C-343676989-0-0-0
 -> JSON           = 14.397 KB
 -> JSON (GZIP  9) =  3.880 KB
 -> JSON (ZSTD 19) =  3.620 KB
 -> Protobuf       =  5.809 KB
--- --- --- --- ---
 -> Protobuf (GZIP  0) = 5.832 KB in 464.8us (decompression = 504.8us)
 -> Protobuf (GZIP  9) = 2.861 KB in 255.2us (decompression = 65us)
--- --- --- --- ---
 -> Protobuf (ZSTD  0) = 2.977 KB in 154.5us (decompression = 578.6us)
 -> Protobuf (ZSTD 22) = 2.758 KB in 1.050900ms (decompression = 30.7us)
The biggest part of the maps are the biome paths.
Copy code
16:80 135 0 -91 -16 0 -4 -6 -9 -2 -11 10 -11 -9 -9 2 -4 8 -16 0 0 93 13 0 7 -7 4 0 8 10 13 -3 3 3 13 -2 4 -6|80 174 0 -21 -13 1 -3 4 -14 1 -2 -2 -14 2 -6 -5 -10 1 -2 2 -16 1 0 16 12:80 153 0 -18 -15 0 -4 6 -13 2 -3 -3 -13 3 -8 -10 -4 0 -7 7 -13 0 0 18 16 -1 2 -2 10 -1 6 5 14 -2 2 2 14 -1 3 -4
That's already polybooled & delta encoded. I save it as string, because (to my surprise) I wasn't able to make a smaller protobuf version out of it.
Just to give an idea how big they are, here the same map with them removed:
Copy code
### CER-C-343676989-0-0-0
 -> JSON           = 10.590 KB
 -> JSON (GZIP  9) =  2.287 KB
 -> JSON (ZSTD 19) =  2.130 KB
 -> Protobuf       =  1.960 KB
--- --- --- --- ---
 -> Protobuf (GZIP  0) = 1.983 KB in 348.2us (decompression = 422.2us)
 -> Protobuf (GZIP  1) = 1.317 KB in  80.4us (decompression = 49.2us)
 -> Protobuf (GZIP  5) = 1.294 KB in  73.0us (decompression = 43.9us)
 -> Protobuf (GZIP  9) = 1.290 KB in  78.0us (decompression = 37.7us)
--- --- --- --- ---
 -> Protobuf (ZSTD  0) = 1.343 KB in  96.9us (decompression = 492.8us)
 -> Protobuf (ZSTD  1) = 1.370 KB in  63.3us (decompression = 21us)
 -> Protobuf (ZSTD  2) = 1.355 KB in  45.6us (decompression = 19.8us)
 -> Protobuf (ZSTD  3) = 1.343 KB in  51.4us (decompression = 19.6us)
 -> Protobuf (ZSTD  4) = 1.336 KB in  71.3us (decompression = 20.2us)
 -> Protobuf (ZSTD  5) = 1.327 KB in  81.3us (decompression = 26.1us)
 -> Protobuf (ZSTD 22) = 1.267 KB in 274.4us (decompression = 22.7us)
I think I already reached some kind of scientific level with my optimizations πŸ˜† It may be good enough already. Just re-compress with strongest zstd and be fine. The gains I saw from a more compact object model seem not worth it to have a rather strange object model I believe. People should still be able to make sense out of it.
And should I need to pay for space I at least know that I’m not accidentally wasting space by having unoptimized data.