https://kotlinlang.org logo
#coroutines
Title
# coroutines
c

camdenorrb

08/24/2020, 7:03 PM
How would you go about making a blocking method such as
Copy code
ByteBuffer.allocateDirect(bufferSize)
suspending, would you need to make a threadpool and do callbacks to unsuspend a coroutine or is there a better way?
o

octylFractal

08/24/2020, 7:05 PM
is
allocateDirect
really blocking? I can't see that taking a ton of time
c

camdenorrb

08/24/2020, 7:05 PM
Pretty sure?
o

octylFractal

08/24/2020, 7:06 PM
I guess it depends on how much
bufferSize
is, but I would not expect it to "block" for a long period of time if it's e.g. 8192
c

camdenorrb

08/24/2020, 7:07 PM
It is when you have a ton of clients pulling around the same time, hm
But anyways, I still wonder how one would go about making it suspending based
o

octylFractal

08/24/2020, 7:08 PM
well, there isn't really a callback for allocating memory since it should be near-instant (e.g.
malloc
doesn't have an async variant), so you'd just throw it on
<http://Dispatchers.IO|Dispatchers.IO>
c

camdenorrb

08/24/2020, 7:09 PM
Copy code
Executors.newFixedThreadPool(max(Runtime.getRuntime().availableProcessors(), 64)).asCoroutineDispatcher()
Would something like this be acceptable as-well instead of Dispatchers.IO. Also I've heard that allocateDirect is actually a big factor in networking api's
o

octylFractal

08/24/2020, 7:10 PM
I would mark that up to the native overhead rather than the allocation itself, i.e. you should look into pooling the buffers you can make your own IO pool if you want, yes
c

camdenorrb

08/24/2020, 7:11 PM
Yeah, we are designing a bufferpool, but by our design if a buffer greater than the size being cached is requested, it'll allocate a new one
o

octylFractal

08/24/2020, 7:11 PM
and I feel like the native overhead is probably around the same as thread-switching, though I don't have benchmarks to prove that
c

camdenorrb

08/24/2020, 7:12 PM
Perhaps
m

MiSikora

08/24/2020, 7:12 PM
Why
IO
though and not
Default
?
o

octylFractal

08/24/2020, 7:14 PM
yea, tbh that's a good question, you may not get any more efficiency from using a different thread here since my assumption is that the native overhead is the limiting factor here, which is entirely CPU-bound
I'm not entirely sure if the CPU can work on other tasks in a
malloc
call either
probably best to benchmark before doing premature optimizations
☝️ 2
I made a pretty bad benchmark at https://gist.github.com/octylFractal/c7f3862bd568c6f69a81bf5e6d62d1c9, but it consistently shows direct allocation WITHOUT context switching as fastest by 2x on my i7-9750H CPU @ 2.60GHz (8 threads, 4 cores)
Copy code
-1799201921
Time to normal allocate: 421ms
-334956506
Time to direct allocate: 213ms
-1950301415
Time to direct allocate on IO: 394ms
but this does seem to change if I bump the size of the allocation up, with a 1000x allocation and 1000 less tasks I get:
Copy code
-410572713
Time to normal allocate: 2076ms
1549332476
Time to direct allocate: 4274ms
1007409705
Time to direct allocate on IO: 1709ms
c

camdenorrb

08/24/2020, 7:25 PM
That seems like a long time to pause the thread a coroutine is running on
Especially for a networking api
o

octylFractal

08/24/2020, 7:25 PM
you might want to see how I measure this, that's a time for the loop as a whole, not a time per allocation (bad wording on my end perhaps)
time per allocation for 8192 bytes is sub-millisecond, approximately 2 microseconds, but that includes some overhead from the timing and hashing machinery
for 8192000, it's approximately 4ms, which is considerably worse, but even with context switching it's still about 1.7ms, and the bigger question is are you really allocating thousands of 8mb buffers?
c

camdenorrb

08/24/2020, 7:28 PM
If I was transmitting a file, yes
Also 8kb
o

octylFractal

08/24/2020, 7:29 PM
no, 8192000 is 8mb
c

camdenorrb

08/24/2020, 7:29 PM
Oh nvm oops
o

octylFractal

08/24/2020, 7:30 PM
I would expect lots of 8192 byte / 8kb buffers for transferring data -- not 8mb
c

camdenorrb

08/24/2020, 7:30 PM
Fair enough
o

octylFractal

08/24/2020, 7:32 PM
wonder if non-power-of-2 sizes affect things as well, using 2**23 for 8mb instead of just 1000x 8192 gives:
Copy code
265224745
Time to direct allocate: 434ms
1744648169
Time to direct allocate on IO: 1230ms
c

camdenorrb

08/24/2020, 7:33 PM
Might need to add a warmup hm
o

octylFractal

08/24/2020, 7:33 PM
yea, this is a really bad benchmark 🙂
z

Zach Klippenstein (he/him) [MOD]

08/24/2020, 7:47 PM
Is the CPU actually doing something during this pause (like writing zeros), or is it just waiting for some other hardware/memory controller to do something? If the CPU is actually doing work, changing threads is probably pointless unless the calling thread is special like a UI thread - it's gonna have to stop other work to do your work anyway. The IO dispatcher is only an optimization for threads which spend most/all their time sleeping. If you have 64 threads (default IO dispatcher limit) all trying to actually use the CPU, only some of them are actually gonna get to execute at the same time anyway.
o

octylFractal

08/24/2020, 7:48 PM
I think it's technically undefined what the CPU is doing in malloc 🙂
c

camdenorrb

08/24/2020, 7:48 PM
My benchmark shows it's pretty fast anyways, so eh not a big deal to me anymore
But if I were to make something blocking suspending instead callbacks is what y'all would do?
o

octylFractal

08/24/2020, 7:50 PM
in general, if it offers a callback-based API. in this case there isn't really an option to do so
if it's something like reading from an InputStream, there isn't really any "callback", you just put it on the IO dispatcher (or similar), and Kotlin will handle the "callback" using coroutine mechanisms
c

camdenorrb

08/24/2020, 7:55 PM
Interesting, thank you! 🙂
o

octylFractal

08/24/2020, 8:04 PM
some more stats from a slightly better benchmark (has warmup and more iters):
Copy code
direct allocate on IO stats: LongSummaryStatistics{count=10, sum=12404, min=1013, average=1240.400000, max=1322} (in ms)
direct allocate stats: LongSummaryStatistics{count=10, sum=11710, min=1008, average=1171.000000, max=1312} (in ms)
so it's basically equivalent either way
but it's probably better to use the simpler form
g

gildor

08/24/2020, 11:57 PM
Allocate buffers on the same thread were you are gonna use it (if you don't have buffer pool), allocation is the least heavy part, you probably do more work to get data for this buffer (networking, io, computation etc), so choose thread police depending on your main workload
t

TwoClocks

08/26/2020, 2:00 AM
I hate it when people answer this way, but I'm gonna do it anyway : Why are you allocating ByteBuffers at any time other than startup? especially direct ones that are mostly useful for IO?
c

camdenorrb

08/26/2020, 2:00 AM
Whenever my pool runs out of bytebuffers to give, but anyways I think I'ma just allocate 2 bytebuffers for every client
t

TwoClocks

08/26/2020, 2:02 AM
I typicaly just have one (maybe two) byte buffers per socket, parse and dispatch pooled objects (or some such)
c

camdenorrb

08/26/2020, 2:02 AM
Yeah, one for read and one for write
Btw, this is an open source project, https://github.com/camdenorrb/Netlius
I need to post the latest update when I figure out why it's not working tho
And... just fixed it
pushed
t

TwoClocks

08/26/2020, 2:07 AM
interesting... I just wrote a coroutine network layer like this.
c

camdenorrb

08/26/2020, 2:07 AM
Fancy
t

TwoClocks

08/26/2020, 2:07 AM
except single threaded.
c

camdenorrb

08/26/2020, 2:07 AM
Oh
t

TwoClocks

08/26/2020, 2:10 AM
yeah. strange for sure. The project has odd performance criteria.. and context switches blow that all up.
c

camdenorrb

08/26/2020, 2:11 AM
Any suggestions?
t

TwoClocks

08/26/2020, 2:12 AM
so, these are blocking sockets? it looks like? (looking at Client.kt)
c

camdenorrb

08/26/2020, 2:13 AM
Nope, they are nonblocking, but I need locks to make sure I'm not writing multiple at the same time
t

TwoClocks

08/26/2020, 2:14 AM
in case the client tries to write to the same socket from different threads at the same time?
c

camdenorrb

08/26/2020, 2:14 AM
Yes
t

TwoClocks

08/26/2020, 2:16 AM
seems like a odd thing to catch. Are reads being dispatches on different threads?
c

camdenorrb

08/26/2020, 2:17 AM
Could be
I actually got an error on a singlethreaded test which I had to use the locks to fix
Since it's all async, a single thread can write multiple at once
Altho, the suspending of the coroutine should've fixed that 😕
But that wouldn't help for multicoroutine
t

TwoClocks

08/26/2020, 2:21 AM
I don't see a selector(), you sure these sockets are non-blocking? It looks like it's just calling channel.read()
c

camdenorrb

08/26/2020, 2:21 AM
Yes, the channel.read uses a callback which runs on another thread
Notice ReadCompletionHandler and how the continuation continues
t

TwoClocks

08/26/2020, 2:23 AM
so you need a thread per socket? or a queue of IO requests that run in another thread?
c

camdenorrb

08/26/2020, 2:23 AM
Pretty sure they use the same ThreadPool
t

TwoClocks

08/26/2020, 2:25 AM
yeah, but each socket eats a thread when it calles read (or write w/ backpresure). So if the # of threads in the pool is less than the # of sockets... things get weird. is my understanding correct?
c

camdenorrb

08/26/2020, 2:29 AM
Depends on a lot of factors actually, including OS but for EPOLL it seems to use a cachedThreadPool which would mean yes if they aren't doing additional scheduling which they might since it's EPOLL
t

TwoClocks

08/26/2020, 2:31 AM
but epoll only helps if your using non-blocking sockets... which I don't think you are.
c

camdenorrb

08/26/2020, 2:31 AM
it is non-blocking sockets
t

TwoClocks

08/26/2020, 2:32 AM
where do you set the channel to non-blocking, I don't see it.
c

camdenorrb

08/26/2020, 2:32 AM
Copy code
AsynchronousServerSocketChannel
Copy code
AsynchronousSocketChannel
t

TwoClocks

08/26/2020, 2:35 AM
ahhh... I've never used any of these async channels.
c

camdenorrb

08/26/2020, 2:35 AM
Yeah, it's interesting
t

TwoClocks

08/26/2020, 2:37 AM
when they call back your handler, what thread are you on, and how do you control it?
BTW: The doc's say it deals with writes from different threads. (read will throw a exception though).
c

camdenorrb

08/26/2020, 2:39 AM
I'm on a random thread from the EPOLL event handling threadpool, and I didn't see them using different threads for read/write specifically for EPOLL when I went through the code
Maybe I missed something tho
t

TwoClocks

08/26/2020, 2:41 AM
epoll/select are all typically don'e single-threaded. so you're probably correct.
now that I understand better, this looks close to what I did. I just did the select() pulling my self.
c

camdenorrb

08/26/2020, 2:46 AM
Ah, nice, it would be interesting if you did a speedtest between the two
t

TwoClocks

08/26/2020, 2:47 AM
It's a little odd to parse in/float/long each w/ a seperate call to read(). typically you read a big chunk and parse all the floats/ints/longs outa that.
c

camdenorrb

08/26/2020, 2:48 AM
Ah yes, that's more up to the protocol, if you saw my Minecraft server implementation I'm using this for I'm doing just that
Which is why I'ma start adding extensions for bytebuffers
t

TwoClocks

08/26/2020, 2:49 AM
also there is no guarantee that the read() returns as many bytes as the other sides write() did. so you might be reading 3-4 fields in 1 read(), but you only parse the first and drop the rest (at least it looks like that to me).
Also, there is no guarantee that a call to read() will get all the bytes from a write() on the other side. so you should check the length before you parse. e.g. you need 8 bytes for a long, but there might be only 7 in the buffer.
c

camdenorrb

08/26/2020, 2:50 AM
For the MC server implementation I get the length of the packet first and then read a bytebuffer of that size
So there is a guarantee in that protocol
As for the Client based reading, it does expect a certain amount of bytes per each read
t

TwoClocks

08/26/2020, 2:52 AM
I'm just looking at the readLong() in the Client.kt. it looks like it call's the sockets read() and then goes to parse 8 bytes out of the bytebuffer.
c

camdenorrb

08/26/2020, 2:52 AM
Ah there is a guarantee otherwise it'll throw an error for BufferUnderFlow
t

TwoClocks

08/26/2020, 2:53 AM
right...but that's kinda bad. if you didn't get 8 bytes, you should read again until you get 8 (assuming your expecting a long on the wire).
c

camdenorrb

08/26/2020, 2:54 AM
Oh wait it's actually a ReadTimeout if they don't provide up to the limit of the buffer
So the client has 30 seconds to send the full amount of data
t

TwoClocks

08/26/2020, 2:54 AM
also, if there are any other fields after the 8 byte long, they are lost here.
c

camdenorrb

08/26/2020, 2:55 AM
And if the client has a readtimeout it disconnects completely
t

TwoClocks

08/26/2020, 2:55 AM
right...but your assuming if the client wrote 8 bytes, your read will receive 8 bytes in one read() call... which isn't true. that's not how TCP works.
most of the time it's true... but not all the time.
c

camdenorrb

08/26/2020, 2:55 AM
Actually it'll wait until all the bytes are sent so it's not exactly "one read call"
t

TwoClocks

08/26/2020, 2:56 AM
and it's pretty much guaranteed to happen if you have TCP back-pressure happening.
c

camdenorrb

08/26/2020, 2:56 AM
What's that?
t

TwoClocks

08/26/2020, 2:57 AM
so TCP is a stream.. it doesn't honer packetization. It just guarantees the bytes arrive in the same order. but not in the same packets.
c

camdenorrb

08/26/2020, 2:58 AM
Is there a test I could run to simulate the issue you're describing?
t

TwoClocks

08/26/2020, 2:58 AM
to take nagle for example. A TCP stack will wait a while before sending data (waiting for a ACK), when it does send the data, it sends all it has. So if the app called write() 3 times while that as happening, you get one packet, and one read() on the other side with all 3 writes in it.
so when backpresure kicks in, it means the reader can't keep up w/ the writer. when means your data is queueing in the stack, and the writers call to write() start to block.
so if your read buffer's size isn't a event modulo of the size of the kernels' tcp buffers, your guaranteed to have a partial "message" (Long) at some point.
c

camdenorrb

08/26/2020, 3:02 AM
So is there a test I could do?
t

TwoClocks

08/26/2020, 3:02 AM
yeah. there is a easy way to test. Have a writer sending lots of small messages, then have a reader that calls sleep() between each read... a slow reader. You'll see the writer's write() start to block, and the reader() will be getting full buffers every time they call read()
in general, you should make sure there are enough bytes available in the buffer before you try and parse anthing. if their arn't you need to call compact() and read() again until there are.
c

camdenorrb

08/26/2020, 3:03 AM
How long of a delay do you reckon?
t

TwoClocks

08/26/2020, 3:04 AM
so if you get 4 bytes of your Long in one read() call, the next 4 should be the first 4 bytes of the next read() call. So make sure you hold on the first 4.
dealy for the test? If the writer is writing open loop (as fast as it can), not much. it'll back-presure quick. 10millis should do it.
c

camdenorrb

08/26/2020, 3:05 AM
Ok
t

TwoClocks

08/26/2020, 3:06 AM
for these kinds of things I write a message that has a 2 byte count, then a count of random ints. I make sure the reader and writer use the same seed in the random # generator and parse all the ints and make sure they match the next int out of the random # generator
c

camdenorrb

08/26/2020, 3:08 AM
Is this good enough?
t

TwoClocks

08/26/2020, 3:08 AM
but I think you have another problem as well (or I'm missing something else). If you call readLong() and the read() call returns 16 bytes, what happens to the 8 bytes after the long is parsed?
c

camdenorrb

08/26/2020, 3:09 AM
readLong reads 8 bytes only
The rest will stay waiting in the channel
t

TwoClocks

08/26/2020, 3:10 AM
ahhh.. I see. You set the limit() to the size.
c

camdenorrb

08/26/2020, 3:10 AM
Indeed
That test results in a timeout btw
Which makes sense
t

TwoClocks

08/26/2020, 3:11 AM
why would it time out?
oh, cuz you have a timeout when you call read().
c

camdenorrb

08/26/2020, 3:12 AM
Because eventually the delay between reading/writing is 30 seconds
t

TwoClocks

08/26/2020, 3:13 AM
I guess that'll work, but your assuming a lot about the network and the protocol. There are a bunch of protocols you couldn't implement with this.
c

camdenorrb

08/26/2020, 3:13 AM
Idk I bet I could figure out them all
t

TwoClocks

08/26/2020, 3:14 AM
there are protocols that have active TCP sessions that don't send anything for hours.
like telnet or ssh
c

camdenorrb

08/26/2020, 3:14 AM
Oh, as the api advances I'll add the option to change the timeout
t

TwoClocks

08/26/2020, 3:15 AM
yeah, and then you'll run into the partical read problem.
c

camdenorrb

08/26/2020, 3:16 AM
Eh I'll keep a mental note of this convo and if it does I'll think of something
t

TwoClocks

08/26/2020, 3:16 AM
if you take the timeout, your slow reader test will fail at some point.... but it's a valid TCP session with valid flow-control
on another note : calling read() is expensive. Even if it's just going to return queued data. It's much more "normal" and fast to read as many bytes as you can when you call read(), and parse fields out of the buffer.
c

camdenorrb

08/26/2020, 3:19 AM
Which is what I do in the actual protocol
But the API also supports reading individual values from the client
I'm adding these extensions to netlius to try to get people to use bytebuffers more often
But if you're using a compressed protocol, you'd be forced to use them at some point anyways and I'll have examples for such aswell
t

TwoClocks

08/26/2020, 3:22 AM
You should look at OpendHFT's Chronicle-Bytes. It's a really good wrapper around ByteBuffers
and does off heap memory, and buffer larger than 2gig, etc.
c

camdenorrb

08/26/2020, 3:23 AM
Heap memory isn't good for networking from what I heard. And I kinda want to avoid wrappers but if I could use extension functions instead I'll consider it
t

TwoClocks

08/26/2020, 3:23 AM
it's got a zillion writers / parers on it. stop-but encoding and the works.
c

camdenorrb

08/26/2020, 3:23 AM
Definitely looks interesting
t

TwoClocks

08/26/2020, 3:24 AM
yeah. it does on heap or off heap. whatever you want.
c

camdenorrb

08/26/2020, 3:24 AM
Thank you for telling me about it, making a comment to look more into it
t

TwoClocks

08/26/2020, 3:24 AM
I like it better than netty'd buffer imppl
c

camdenorrb

08/26/2020, 3:24 AM
Yeah, Netty is pretty damn fast tho somehow
I wish Kotlinx IO was finished :C
t

TwoClocks

08/26/2020, 3:25 AM
control of the threading model....
context switches... brutal...
c

camdenorrb

08/26/2020, 3:25 AM
Oof
Fibers would probably fix that in the future tho
t

TwoClocks

08/26/2020, 3:26 AM
that's basically what I did w/ mine. It's all coroutines, but only one thread.
it isn't very good wide (10K sockets), but it's fast when shallow.
c

camdenorrb

08/26/2020, 3:28 AM
Hm alright
t

TwoClocks

08/26/2020, 3:28 AM
all the really go-fast stuff tries to keep the # of threads <= the number of cores., and queues work until a thread is ready.
c

camdenorrb

08/26/2020, 3:29 AM
I'm probably gonna use some JNI in the future and make my own EPOLL wrapper, maybe that would help with such
t

TwoClocks

08/26/2020, 3:30 AM
that's what netty and those framsworks do. I've done it in the past and didn't see a huge boost over java's epoll impl.
c

camdenorrb

08/26/2020, 3:31 AM
Interesting, I've never done JNI yet tho, so maybe it'll be fun regardless
t

TwoClocks

08/26/2020, 3:31 AM
except for UDP receive(). it's much faster to do you own and not new a SocketAddress() for every packet.
c

camdenorrb

08/26/2020, 3:31 AM
It would be sick if I could use Kotlin/Native in JVM aswell
Then again Kotlin/Native is slow af atm
Oh?
Noted
t

TwoClocks

08/26/2020, 3:32 AM
but that's only useful if your trying to do that zero garbage collection stuff. or at least that's where it's very useful.
c

camdenorrb

08/26/2020, 3:33 AM
Hm alright
t

TwoClocks

08/26/2020, 3:48 AM
there are a bunch of high performace impls that do stack bypass. They NIC just DMA's directly in to main memory, and the rest is up to you
c

camdenorrb

08/26/2020, 3:49 AM
Fun
t

TwoClocks

08/26/2020, 3:49 AM
OpenOnload is one of the better known ones. https://www.openonload.org/
but yeah a tcpstack in usermode. party on the bits.
c

camdenorrb

08/26/2020, 3:50 AM
Bookmarked
52 Views