How would you go about making a blocking method such as ```B kotlinlang #coroutines

How would you go about making a blocking method su...

camdenorrb

08/24/2020, 7:03 PM

How would you go about making a blocking method such as

Copy code

ByteBuffer.allocateDirect(bufferSize)

suspending, would you need to make a threadpool and do callbacks to unsuspend a coroutine or is there a better way?

octylFractal

08/24/2020, 7:05 PM

allocateDirect

really blocking? I can't see that taking a ton of time

camdenorrb

08/24/2020, 7:05 PM

Pretty sure?

octylFractal

08/24/2020, 7:06 PM

I guess it depends on how much

bufferSize

is, but I would not expect it to "block" for a long period of time if it's e.g. 8192

camdenorrb

08/24/2020, 7:07 PM

It is when you have a ton of clients pulling around the same time, hm

camdenorrb

08/24/2020, 7:07 PM

But anyways, I still wonder how one would go about making it suspending based

octylFractal

08/24/2020, 7:08 PM

well, there isn't really a callback for allocating memory since it should be near-instant (e.g.

malloc

doesn't have an async variant), so you'd just throw it on

<http://Dispatchers.IO|Dispatchers.IO>

camdenorrb

08/24/2020, 7:09 PM

Copy code

Executors.newFixedThreadPool(max(Runtime.getRuntime().availableProcessors(), 64)).asCoroutineDispatcher()

Would something like this be acceptable as-well instead of Dispatchers.IO. Also I've heard that allocateDirect is actually a big factor in networking api's

octylFractal

08/24/2020, 7:10 PM

I would mark that up to the native overhead rather than the allocation itself, i.e. you should look into pooling the buffers you can make your own IO pool if you want, yes

camdenorrb

08/24/2020, 7:11 PM

Yeah, we are designing a bufferpool, but by our design if a buffer greater than the size being cached is requested, it'll allocate a new one

octylFractal

08/24/2020, 7:11 PM

and I feel like the native overhead is probably around the same as thread-switching, though I don't have benchmarks to prove that

camdenorrb

08/24/2020, 7:12 PM

Perhaps

MiSikora

08/24/2020, 7:12 PM

Why

IO

though and not

Default

octylFractal

08/24/2020, 7:14 PM

yea, tbh that's a good question, you may not get any more efficiency from using a different thread here since my assumption is that the native overhead is the limiting factor here, which is entirely CPU-bound

octylFractal

08/24/2020, 7:15 PM

I'm not entirely sure if the CPU can work on other tasks in a

malloc

call either

octylFractal

08/24/2020, 7:15 PM

probably best to benchmark before doing premature optimizations

☝️ 2

octylFractal

08/24/2020, 7:23 PM

I made a pretty bad benchmark at https://gist.github.com/octylFractal/c7f3862bd568c6f69a81bf5e6d62d1c9, but it consistently shows direct allocation WITHOUT context switching as fastest by 2x on my i7-9750H CPU @ 2.60GHz (8 threads, 4 cores)

Copy code

-1799201921
Time to normal allocate: 421ms
-334956506
Time to direct allocate: 213ms
-1950301415
Time to direct allocate on IO: 394ms

octylFractal

08/24/2020, 7:24 PM

but this does seem to change if I bump the size of the allocation up, with a 1000x allocation and 1000 less tasks I get:

Copy code

-410572713
Time to normal allocate: 2076ms
1549332476
Time to direct allocate: 4274ms
1007409705
Time to direct allocate on IO: 1709ms

camdenorrb

08/24/2020, 7:25 PM

That seems like a long time to pause the thread a coroutine is running on

camdenorrb

08/24/2020, 7:25 PM

Especially for a networking api

octylFractal

08/24/2020, 7:25 PM

you might want to see how I measure this, that's a time for the loop as a whole, not a time per allocation (bad wording on my end perhaps)

octylFractal

08/24/2020, 7:27 PM

time per allocation for 8192 bytes is sub-millisecond, approximately 2 microseconds, but that includes some overhead from the timing and hashing machinery

octylFractal

08/24/2020, 7:28 PM

for 8192000, it's approximately 4ms, which is considerably worse, but even with context switching it's still about 1.7ms, and the bigger question is are you really allocating thousands of 8mb buffers?

camdenorrb

08/24/2020, 7:28 PM

If I was transmitting a file, yes

camdenorrb

08/24/2020, 7:29 PM

Also 8kb

octylFractal

08/24/2020, 7:29 PM

no, 8192000 is 8mb

camdenorrb

08/24/2020, 7:29 PM

Oh nvm oops

octylFractal

08/24/2020, 7:30 PM

I would expect lots of 8192 byte / 8kb buffers for transferring data -- not 8mb

camdenorrb

08/24/2020, 7:30 PM

Fair enough

octylFractal

08/24/2020, 7:32 PM

wonder if non-power-of-2 sizes affect things as well, using 2**23 for 8mb instead of just 1000x 8192 gives:

Copy code

265224745
Time to direct allocate: 434ms
1744648169
Time to direct allocate on IO: 1230ms

camdenorrb

08/24/2020, 7:33 PM

Might need to add a warmup hm

octylFractal

08/24/2020, 7:33 PM

yea, this is a really bad benchmark 🙂

Zach Klippenstein (he/him) [MOD]

08/24/2020, 7:47 PM

Is the CPU actually doing something during this pause (like writing zeros), or is it just waiting for some other hardware/memory controller to do something? If the CPU is actually doing work, changing threads is probably pointless unless the calling thread is special like a UI thread - it's gonna have to stop other work to do your work anyway. The IO dispatcher is only an optimization for threads which spend most/all their time sleeping. If you have 64 threads (default IO dispatcher limit) all trying to actually use the CPU, only some of them are actually gonna get to execute at the same time anyway.

octylFractal

08/24/2020, 7:48 PM

I think it's technically undefined what the CPU is doing in malloc 🙂

camdenorrb

08/24/2020, 7:48 PM

My benchmark shows it's pretty fast anyways, so eh not a big deal to me anymore

camdenorrb

08/24/2020, 7:49 PM

But if I were to make something blocking suspending instead callbacks is what y'all would do?

octylFractal

08/24/2020, 7:50 PM

in general, if it offers a callback-based API. in this case there isn't really an option to do so

octylFractal

08/24/2020, 7:51 PM

if it's something like reading from an InputStream, there isn't really any "callback", you just put it on the IO dispatcher (or similar), and Kotlin will handle the "callback" using coroutine mechanisms

camdenorrb

08/24/2020, 7:55 PM

Interesting, thank you! 🙂

octylFractal

08/24/2020, 8:04 PM

some more stats from a slightly better benchmark (has warmup and more iters):

Copy code

direct allocate on IO stats: LongSummaryStatistics{count=10, sum=12404, min=1013, average=1240.400000, max=1322} (in ms)
direct allocate stats: LongSummaryStatistics{count=10, sum=11710, min=1008, average=1171.000000, max=1312} (in ms)

octylFractal

08/24/2020, 8:05 PM

so it's basically equivalent either way

octylFractal

08/24/2020, 8:05 PM

but it's probably better to use the simpler form

gildor

08/24/2020, 11:57 PM

Allocate buffers on the same thread were you are gonna use it (if you don't have buffer pool), allocation is the least heavy part, you probably do more work to get data for this buffer (networking, io, computation etc), so choose thread police depending on your main workload

TwoClocks

08/26/2020, 2:00 AM

I hate it when people answer this way, but I'm gonna do it anyway : Why are you allocating ByteBuffers at any time other than startup? especially direct ones that are mostly useful for IO?

camdenorrb

08/26/2020, 2:00 AM

Whenever my pool runs out of bytebuffers to give, but anyways I think I'ma just allocate 2 bytebuffers for every client

TwoClocks

08/26/2020, 2:02 AM

I typicaly just have one (maybe two) byte buffers per socket, parse and dispatch pooled objects (or some such)

camdenorrb

08/26/2020, 2:02 AM

Yeah, one for read and one for write

camdenorrb

08/26/2020, 2:03 AM

Btw, this is an open source project, https://github.com/camdenorrb/Netlius

camdenorrb

08/26/2020, 2:03 AM

I need to post the latest update when I figure out why it's not working tho

camdenorrb

08/26/2020, 2:04 AM

And... just fixed it

camdenorrb

08/26/2020, 2:06 AM

pushed

TwoClocks

08/26/2020, 2:07 AM

interesting... I just wrote a coroutine network layer like this.

camdenorrb

08/26/2020, 2:07 AM

Fancy

TwoClocks

08/26/2020, 2:07 AM

except single threaded.

camdenorrb

08/26/2020, 2:07 AM

TwoClocks

08/26/2020, 2:10 AM

yeah. strange for sure. The project has odd performance criteria.. and context switches blow that all up.

camdenorrb

08/26/2020, 2:11 AM

Any suggestions?

TwoClocks

08/26/2020, 2:12 AM

so, these are blocking sockets? it looks like? (looking at Client.kt)

camdenorrb

08/26/2020, 2:13 AM

Nope, they are nonblocking, but I need locks to make sure I'm not writing multiple at the same time

TwoClocks

08/26/2020, 2:14 AM

in case the client tries to write to the same socket from different threads at the same time?

camdenorrb

08/26/2020, 2:14 AM

Yes

TwoClocks

08/26/2020, 2:16 AM

seems like a odd thing to catch. Are reads being dispatches on different threads?

camdenorrb

08/26/2020, 2:17 AM

Could be

camdenorrb

08/26/2020, 2:17 AM

I actually got an error on a singlethreaded test which I had to use the locks to fix

camdenorrb

08/26/2020, 2:17 AM

Since it's all async, a single thread can write multiple at once

camdenorrb

08/26/2020, 2:18 AM

Altho, the suspending of the coroutine should've fixed that 😕

camdenorrb

08/26/2020, 2:21 AM

But that wouldn't help for multicoroutine

TwoClocks

08/26/2020, 2:21 AM

I don't see a selector(), you sure these sockets are non-blocking? It looks like it's just calling channel.read()

camdenorrb

08/26/2020, 2:21 AM

Yes, the channel.read uses a callback which runs on another thread

camdenorrb

08/26/2020, 2:22 AM

Notice ReadCompletionHandler and how the continuation continues

TwoClocks

08/26/2020, 2:23 AM

so you need a thread per socket? or a queue of IO requests that run in another thread?

camdenorrb

08/26/2020, 2:23 AM

Pretty sure they use the same ThreadPool

TwoClocks

08/26/2020, 2:25 AM

yeah, but each socket eats a thread when it calles read (or write w/ backpresure). So if the # of threads in the pool is less than the # of sockets... things get weird. is my understanding correct?

camdenorrb

08/26/2020, 2:29 AM

Depends on a lot of factors actually, including OS but for EPOLL it seems to use a cachedThreadPool which would mean yes if they aren't doing additional scheduling which they might since it's EPOLL

TwoClocks

08/26/2020, 2:31 AM

but epoll only helps if your using non-blocking sockets... which I don't think you are.

camdenorrb

08/26/2020, 2:31 AM

it is non-blocking sockets

TwoClocks

08/26/2020, 2:32 AM

where do you set the channel to non-blocking, I don't see it.

camdenorrb

08/26/2020, 2:32 AM

Copy code

AsynchronousServerSocketChannel

camdenorrb

08/26/2020, 2:32 AM

Copy code

AsynchronousSocketChannel

TwoClocks

08/26/2020, 2:35 AM

ahhh... I've never used any of these async channels.

camdenorrb

08/26/2020, 2:35 AM

Yeah, it's interesting

TwoClocks

08/26/2020, 2:37 AM

when they call back your handler, what thread are you on, and how do you control it?

TwoClocks

08/26/2020, 2:38 AM

BTW: The doc's say it deals with writes from different threads. (read will throw a exception though).

camdenorrb

08/26/2020, 2:39 AM

I'm on a random thread from the EPOLL event handling threadpool, and I didn't see them using different threads for read/write specifically for EPOLL when I went through the code

camdenorrb

08/26/2020, 2:39 AM

Maybe I missed something tho

TwoClocks

08/26/2020, 2:41 AM

epoll/select are all typically don'e single-threaded. so you're probably correct.

TwoClocks

08/26/2020, 2:45 AM

now that I understand better, this looks close to what I did. I just did the select() pulling my self.

camdenorrb

08/26/2020, 2:46 AM

Ah, nice, it would be interesting if you did a speedtest between the two

TwoClocks

08/26/2020, 2:47 AM

It's a little odd to parse in/float/long each w/ a seperate call to read(). typically you read a big chunk and parse all the floats/ints/longs outa that.

camdenorrb

08/26/2020, 2:48 AM

Ah yes, that's more up to the protocol, if you saw my Minecraft server implementation I'm using this for I'm doing just that

camdenorrb

08/26/2020, 2:48 AM

Which is why I'ma start adding extensions for bytebuffers

TwoClocks

08/26/2020, 2:49 AM

also there is no guarantee that the read() returns as many bytes as the other sides write() did. so you might be reading 3-4 fields in 1 read(), but you only parse the first and drop the rest (at least it looks like that to me).

TwoClocks

08/26/2020, 2:50 AM

Also, there is no guarantee that a call to read() will get all the bytes from a write() on the other side. so you should check the length before you parse. e.g. you need 8 bytes for a long, but there might be only 7 in the buffer.

camdenorrb

08/26/2020, 2:50 AM

For the MC server implementation I get the length of the packet first and then read a bytebuffer of that size

camdenorrb

08/26/2020, 2:50 AM

So there is a guarantee in that protocol

camdenorrb

08/26/2020, 2:50 AM

As for the Client based reading, it does expect a certain amount of bytes per each read

TwoClocks

08/26/2020, 2:52 AM

I'm just looking at the readLong() in the Client.kt. it looks like it call's the sockets read() and then goes to parse 8 bytes out of the bytebuffer.

camdenorrb

08/26/2020, 2:52 AM

Ah there is a guarantee otherwise it'll throw an error for BufferUnderFlow

TwoClocks

08/26/2020, 2:53 AM

right...but that's kinda bad. if you didn't get 8 bytes, you should read again until you get 8 (assuming your expecting a long on the wire).

camdenorrb

08/26/2020, 2:54 AM

Oh wait it's actually a ReadTimeout if they don't provide up to the limit of the buffer

camdenorrb

08/26/2020, 2:54 AM

So the client has 30 seconds to send the full amount of data

TwoClocks

08/26/2020, 2:54 AM

also, if there are any other fields after the 8 byte long, they are lost here.

camdenorrb

08/26/2020, 2:55 AM

And if the client has a readtimeout it disconnects completely

TwoClocks

08/26/2020, 2:55 AM

right...but your assuming if the client wrote 8 bytes, your read will receive 8 bytes in one read() call... which isn't true. that's not how TCP works.

TwoClocks

08/26/2020, 2:55 AM

most of the time it's true... but not all the time.

camdenorrb

08/26/2020, 2:55 AM

Actually it'll wait until all the bytes are sent so it's not exactly "one read call"

TwoClocks

08/26/2020, 2:56 AM

and it's pretty much guaranteed to happen if you have TCP back-pressure happening.

camdenorrb

08/26/2020, 2:56 AM

What's that?

TwoClocks

08/26/2020, 2:57 AM

so TCP is a stream.. it doesn't honer packetization. It just guarantees the bytes arrive in the same order. but not in the same packets.

camdenorrb

08/26/2020, 2:58 AM

Is there a test I could run to simulate the issue you're describing?

TwoClocks

08/26/2020, 2:58 AM

to take nagle for example. A TCP stack will wait a while before sending data (waiting for a ACK), when it does send the data, it sends all it has. So if the app called write() 3 times while that as happening, you get one packet, and one read() on the other side with all 3 writes in it.

TwoClocks

08/26/2020, 3:00 AM

so when backpresure kicks in, it means the reader can't keep up w/ the writer. when means your data is queueing in the stack, and the writers call to write() start to block.

TwoClocks

08/26/2020, 3:01 AM

so if your read buffer's size isn't a event modulo of the size of the kernels' tcp buffers, your guaranteed to have a partial "message" (Long) at some point.

camdenorrb

08/26/2020, 3:02 AM

So is there a test I could do?

TwoClocks

08/26/2020, 3:02 AM

yeah. there is a easy way to test. Have a writer sending lots of small messages, then have a reader that calls sleep() between each read... a slow reader. You'll see the writer's write() start to block, and the reader() will be getting full buffers every time they call read()

TwoClocks

08/26/2020, 3:03 AM

in general, you should make sure there are enough bytes available in the buffer before you try and parse anthing. if their arn't you need to call compact() and read() again until there are.

camdenorrb

08/26/2020, 3:03 AM

How long of a delay do you reckon?

TwoClocks

08/26/2020, 3:04 AM

so if you get 4 bytes of your Long in one read() call, the next 4 should be the first 4 bytes of the next read() call. So make sure you hold on the first 4.

TwoClocks

08/26/2020, 3:05 AM

dealy for the test? If the writer is writing open loop (as fast as it can), not much. it'll back-presure quick. 10millis should do it.

camdenorrb

08/26/2020, 3:05 AM

TwoClocks

08/26/2020, 3:06 AM

for these kinds of things I write a message that has a 2 byte count, then a count of random ints. I make sure the reader and writer use the same seed in the random # generator and parse all the ints and make sure they match the next int out of the random # generator

camdenorrb

08/26/2020, 3:08 AM

Is this good enough?

TwoClocks

08/26/2020, 3:08 AM

but I think you have another problem as well (or I'm missing something else). If you call readLong() and the read() call returns 16 bytes, what happens to the 8 bytes after the long is parsed?

camdenorrb

08/26/2020, 3:09 AM

readLong reads 8 bytes only

camdenorrb

08/26/2020, 3:09 AM

The rest will stay waiting in the channel

TwoClocks

08/26/2020, 3:10 AM

ahhh.. I see. You set the limit() to the size.

camdenorrb

08/26/2020, 3:10 AM

Indeed

camdenorrb

08/26/2020, 3:10 AM

That test results in a timeout btw

camdenorrb

08/26/2020, 3:11 AM

Which makes sense

TwoClocks

08/26/2020, 3:11 AM

why would it time out?

TwoClocks

08/26/2020, 3:11 AM

oh, cuz you have a timeout when you call read().

camdenorrb

08/26/2020, 3:12 AM

Because eventually the delay between reading/writing is 30 seconds

TwoClocks

08/26/2020, 3:13 AM

I guess that'll work, but your assuming a lot about the network and the protocol. There are a bunch of protocols you couldn't implement with this.

camdenorrb

08/26/2020, 3:13 AM

Idk I bet I could figure out them all

TwoClocks

08/26/2020, 3:14 AM

there are protocols that have active TCP sessions that don't send anything for hours.

TwoClocks

08/26/2020, 3:14 AM

like telnet or ssh

camdenorrb

08/26/2020, 3:14 AM

Oh, as the api advances I'll add the option to change the timeout

TwoClocks

08/26/2020, 3:15 AM

yeah, and then you'll run into the partical read problem.

camdenorrb

08/26/2020, 3:16 AM

Eh I'll keep a mental note of this convo and if it does I'll think of something

TwoClocks

08/26/2020, 3:16 AM

if you take the timeout, your slow reader test will fail at some point.... but it's a valid TCP session with valid flow-control

TwoClocks

08/26/2020, 3:18 AM

on another note : calling read() is expensive. Even if it's just going to return queued data. It's much more "normal" and fast to read as many bytes as you can when you call read(), and parse fields out of the buffer.

camdenorrb

08/26/2020, 3:19 AM

Which is what I do in the actual protocol

camdenorrb

08/26/2020, 3:20 AM

But the API also supports reading individual values from the client

camdenorrb

08/26/2020, 3:20 AM

I'm adding these extensions to netlius to try to get people to use bytebuffers more often

camdenorrb

08/26/2020, 3:21 AM

But if you're using a compressed protocol, you'd be forced to use them at some point anyways and I'll have examples for such aswell

TwoClocks

08/26/2020, 3:22 AM

You should look at OpendHFT's Chronicle-Bytes. It's a really good wrapper around ByteBuffers

TwoClocks

08/26/2020, 3:22 AM

and does off heap memory, and buffer larger than 2gig, etc.

camdenorrb

08/26/2020, 3:23 AM

Heap memory isn't good for networking from what I heard. And I kinda want to avoid wrappers but if I could use extension functions instead I'll consider it

TwoClocks

08/26/2020, 3:23 AM

it's got a zillion writers / parers on it. stop-but encoding and the works.

camdenorrb

08/26/2020, 3:23 AM

Definitely looks interesting

TwoClocks

08/26/2020, 3:24 AM

yeah. it does on heap or off heap. whatever you want.

camdenorrb

08/26/2020, 3:24 AM

Thank you for telling me about it, making a comment to look more into it

TwoClocks

08/26/2020, 3:24 AM

I like it better than netty'd buffer imppl

camdenorrb

08/26/2020, 3:24 AM

Yeah, Netty is pretty damn fast tho somehow

camdenorrb

08/26/2020, 3:25 AM

I wish Kotlinx IO was finished :C

TwoClocks

08/26/2020, 3:25 AM

control of the threading model....

TwoClocks

08/26/2020, 3:25 AM

context switches... brutal...

camdenorrb

08/26/2020, 3:25 AM

Oof

camdenorrb

08/26/2020, 3:26 AM

Fibers would probably fix that in the future tho

TwoClocks

08/26/2020, 3:26 AM

that's basically what I did w/ mine. It's all coroutines, but only one thread.

TwoClocks

08/26/2020, 3:27 AM

it isn't very good wide (10K sockets), but it's fast when shallow.

camdenorrb

08/26/2020, 3:28 AM

Hm alright

TwoClocks

08/26/2020, 3:28 AM

all the really go-fast stuff tries to keep the # of threads <= the number of cores., and queues work until a thread is ready.

camdenorrb

08/26/2020, 3:29 AM

I'm probably gonna use some JNI in the future and make my own EPOLL wrapper, maybe that would help with such

TwoClocks

08/26/2020, 3:30 AM

that's what netty and those framsworks do. I've done it in the past and didn't see a huge boost over java's epoll impl.

camdenorrb

08/26/2020, 3:31 AM

Interesting, I've never done JNI yet tho, so maybe it'll be fun regardless

TwoClocks

08/26/2020, 3:31 AM

except for UDP receive(). it's much faster to do you own and not new a SocketAddress() for every packet.

camdenorrb

08/26/2020, 3:31 AM

It would be sick if I could use Kotlin/Native in JVM aswell

camdenorrb

08/26/2020, 3:31 AM

Then again Kotlin/Native is slow af atm

camdenorrb

08/26/2020, 3:31 AM

Oh?

camdenorrb

08/26/2020, 3:31 AM

Noted

TwoClocks

08/26/2020, 3:32 AM

but that's only useful if your trying to do that zero garbage collection stuff. or at least that's where it's very useful.

camdenorrb

08/26/2020, 3:33 AM

Hm alright

camdenorrb

08/26/2020, 3:37 AM

I wanted to make a kernel level one based on this paper http://highscalability.com/blog/2013/5/13/the-secret-to-10-million-concurrent-connections-the-kernel-i.html

TwoClocks

08/26/2020, 3:48 AM

there are a bunch of high performace impls that do stack bypass. They NIC just DMA's directly in to main memory, and the rest is up to you

camdenorrb

08/26/2020, 3:49 AM

Fun

TwoClocks

08/26/2020, 3:49 AM

OpenOnload is one of the better known ones. https://www.openonload.org/

TwoClocks

08/26/2020, 3:49 AM

but yeah a tcpstack in usermode. party on the bits.

camdenorrb

08/26/2020, 3:50 AM

Bookmarked

76 Views

Open in Slack

Previous Next