How would you go about making a blocking method su...
# coroutines
c
How would you go about making a blocking method such as
Copy code
ByteBuffer.allocateDirect(bufferSize)
suspending, would you need to make a threadpool and do callbacks to unsuspend a coroutine or is there a better way?
o
is
allocateDirect
really blocking? I can't see that taking a ton of time
c
Pretty sure?
o
I guess it depends on how much
bufferSize
is, but I would not expect it to "block" for a long period of time if it's e.g. 8192
c
It is when you have a ton of clients pulling around the same time, hm
But anyways, I still wonder how one would go about making it suspending based
o
well, there isn't really a callback for allocating memory since it should be near-instant (e.g.
malloc
doesn't have an async variant), so you'd just throw it on
<http://Dispatchers.IO|Dispatchers.IO>
c
Copy code
Executors.newFixedThreadPool(max(Runtime.getRuntime().availableProcessors(), 64)).asCoroutineDispatcher()
Would something like this be acceptable as-well instead of Dispatchers.IO. Also I've heard that allocateDirect is actually a big factor in networking api's
o
I would mark that up to the native overhead rather than the allocation itself, i.e. you should look into pooling the buffers you can make your own IO pool if you want, yes
c
Yeah, we are designing a bufferpool, but by our design if a buffer greater than the size being cached is requested, it'll allocate a new one
o
and I feel like the native overhead is probably around the same as thread-switching, though I don't have benchmarks to prove that
c
Perhaps
m
Why
IO
though and not
Default
?
o
yea, tbh that's a good question, you may not get any more efficiency from using a different thread here since my assumption is that the native overhead is the limiting factor here, which is entirely CPU-bound
I'm not entirely sure if the CPU can work on other tasks in a
malloc
call either
probably best to benchmark before doing premature optimizations
☝️ 2
I made a pretty bad benchmark at https://gist.github.com/octylFractal/c7f3862bd568c6f69a81bf5e6d62d1c9, but it consistently shows direct allocation WITHOUT context switching as fastest by 2x on my i7-9750H CPU @ 2.60GHz (8 threads, 4 cores)
Copy code
-1799201921
Time to normal allocate: 421ms
-334956506
Time to direct allocate: 213ms
-1950301415
Time to direct allocate on IO: 394ms
but this does seem to change if I bump the size of the allocation up, with a 1000x allocation and 1000 less tasks I get:
Copy code
-410572713
Time to normal allocate: 2076ms
1549332476
Time to direct allocate: 4274ms
1007409705
Time to direct allocate on IO: 1709ms
c
That seems like a long time to pause the thread a coroutine is running on
Especially for a networking api
o
you might want to see how I measure this, that's a time for the loop as a whole, not a time per allocation (bad wording on my end perhaps)
time per allocation for 8192 bytes is sub-millisecond, approximately 2 microseconds, but that includes some overhead from the timing and hashing machinery
for 8192000, it's approximately 4ms, which is considerably worse, but even with context switching it's still about 1.7ms, and the bigger question is are you really allocating thousands of 8mb buffers?
c
If I was transmitting a file, yes
Also 8kb
o
no, 8192000 is 8mb
c
Oh nvm oops
o
I would expect lots of 8192 byte / 8kb buffers for transferring data -- not 8mb
c
Fair enough
o
wonder if non-power-of-2 sizes affect things as well, using 2**23 for 8mb instead of just 1000x 8192 gives:
Copy code
265224745
Time to direct allocate: 434ms
1744648169
Time to direct allocate on IO: 1230ms
c
Might need to add a warmup hm
o
yea, this is a really bad benchmark 🙂
z
Is the CPU actually doing something during this pause (like writing zeros), or is it just waiting for some other hardware/memory controller to do something? If the CPU is actually doing work, changing threads is probably pointless unless the calling thread is special like a UI thread - it's gonna have to stop other work to do your work anyway. The IO dispatcher is only an optimization for threads which spend most/all their time sleeping. If you have 64 threads (default IO dispatcher limit) all trying to actually use the CPU, only some of them are actually gonna get to execute at the same time anyway.
o
I think it's technically undefined what the CPU is doing in malloc 🙂
c
My benchmark shows it's pretty fast anyways, so eh not a big deal to me anymore
But if I were to make something blocking suspending instead callbacks is what y'all would do?
o
in general, if it offers a callback-based API. in this case there isn't really an option to do so
if it's something like reading from an InputStream, there isn't really any "callback", you just put it on the IO dispatcher (or similar), and Kotlin will handle the "callback" using coroutine mechanisms
c
Interesting, thank you! 🙂
o
some more stats from a slightly better benchmark (has warmup and more iters):
Copy code
direct allocate on IO stats: LongSummaryStatistics{count=10, sum=12404, min=1013, average=1240.400000, max=1322} (in ms)
direct allocate stats: LongSummaryStatistics{count=10, sum=11710, min=1008, average=1171.000000, max=1312} (in ms)
so it's basically equivalent either way
but it's probably better to use the simpler form
g
Allocate buffers on the same thread were you are gonna use it (if you don't have buffer pool), allocation is the least heavy part, you probably do more work to get data for this buffer (networking, io, computation etc), so choose thread police depending on your main workload
t
I hate it when people answer this way, but I'm gonna do it anyway : Why are you allocating ByteBuffers at any time other than startup? especially direct ones that are mostly useful for IO?
c
Whenever my pool runs out of bytebuffers to give, but anyways I think I'ma just allocate 2 bytebuffers for every client
t
I typicaly just have one (maybe two) byte buffers per socket, parse and dispatch pooled objects (or some such)
c
Yeah, one for read and one for write
Btw, this is an open source project, https://github.com/camdenorrb/Netlius
I need to post the latest update when I figure out why it's not working tho
And... just fixed it
pushed
t
interesting... I just wrote a coroutine network layer like this.
c
Fancy
t
except single threaded.
c
Oh
t
yeah. strange for sure. The project has odd performance criteria.. and context switches blow that all up.
c
Any suggestions?
t
so, these are blocking sockets? it looks like? (looking at Client.kt)
c
Nope, they are nonblocking, but I need locks to make sure I'm not writing multiple at the same time
t
in case the client tries to write to the same socket from different threads at the same time?
c
Yes
t
seems like a odd thing to catch. Are reads being dispatches on different threads?
c
Could be
I actually got an error on a singlethreaded test which I had to use the locks to fix
Since it's all async, a single thread can write multiple at once
Altho, the suspending of the coroutine should've fixed that 😕
But that wouldn't help for multicoroutine
t
I don't see a selector(), you sure these sockets are non-blocking? It looks like it's just calling channel.read()
c
Yes, the channel.read uses a callback which runs on another thread
Notice ReadCompletionHandler and how the continuation continues
t
so you need a thread per socket? or a queue of IO requests that run in another thread?
c
Pretty sure they use the same ThreadPool
t
yeah, but each socket eats a thread when it calles read (or write w/ backpresure). So if the # of threads in the pool is less than the # of sockets... things get weird. is my understanding correct?
c
Depends on a lot of factors actually, including OS but for EPOLL it seems to use a cachedThreadPool which would mean yes if they aren't doing additional scheduling which they might since it's EPOLL
t
but epoll only helps if your using non-blocking sockets... which I don't think you are.
c
it is non-blocking sockets
t
where do you set the channel to non-blocking, I don't see it.
c
Copy code
AsynchronousServerSocketChannel
Copy code
AsynchronousSocketChannel
t
ahhh... I've never used any of these async channels.
c
Yeah, it's interesting
t
when they call back your handler, what thread are you on, and how do you control it?
BTW: The doc's say it deals with writes from different threads. (read will throw a exception though).
c
I'm on a random thread from the EPOLL event handling threadpool, and I didn't see them using different threads for read/write specifically for EPOLL when I went through the code
Maybe I missed something tho
t
epoll/select are all typically don'e single-threaded. so you're probably correct.
now that I understand better, this looks close to what I did. I just did the select() pulling my self.
c
Ah, nice, it would be interesting if you did a speedtest between the two
t
It's a little odd to parse in/float/long each w/ a seperate call to read(). typically you read a big chunk and parse all the floats/ints/longs outa that.
c
Ah yes, that's more up to the protocol, if you saw my Minecraft server implementation I'm using this for I'm doing just that
Which is why I'ma start adding extensions for bytebuffers
t
also there is no guarantee that the read() returns as many bytes as the other sides write() did. so you might be reading 3-4 fields in 1 read(), but you only parse the first and drop the rest (at least it looks like that to me).
Also, there is no guarantee that a call to read() will get all the bytes from a write() on the other side. so you should check the length before you parse. e.g. you need 8 bytes for a long, but there might be only 7 in the buffer.
c
For the MC server implementation I get the length of the packet first and then read a bytebuffer of that size
So there is a guarantee in that protocol
As for the Client based reading, it does expect a certain amount of bytes per each read
t
I'm just looking at the readLong() in the Client.kt. it looks like it call's the sockets read() and then goes to parse 8 bytes out of the bytebuffer.
c
Ah there is a guarantee otherwise it'll throw an error for BufferUnderFlow
t
right...but that's kinda bad. if you didn't get 8 bytes, you should read again until you get 8 (assuming your expecting a long on the wire).
c
Oh wait it's actually a ReadTimeout if they don't provide up to the limit of the buffer
So the client has 30 seconds to send the full amount of data
t
also, if there are any other fields after the 8 byte long, they are lost here.
c
And if the client has a readtimeout it disconnects completely
t
right...but your assuming if the client wrote 8 bytes, your read will receive 8 bytes in one read() call... which isn't true. that's not how TCP works.
most of the time it's true... but not all the time.
c
Actually it'll wait until all the bytes are sent so it's not exactly "one read call"
t
and it's pretty much guaranteed to happen if you have TCP back-pressure happening.
c
What's that?
t
so TCP is a stream.. it doesn't honer packetization. It just guarantees the bytes arrive in the same order. but not in the same packets.
c
Is there a test I could run to simulate the issue you're describing?
t
to take nagle for example. A TCP stack will wait a while before sending data (waiting for a ACK), when it does send the data, it sends all it has. So if the app called write() 3 times while that as happening, you get one packet, and one read() on the other side with all 3 writes in it.
so when backpresure kicks in, it means the reader can't keep up w/ the writer. when means your data is queueing in the stack, and the writers call to write() start to block.
so if your read buffer's size isn't a event modulo of the size of the kernels' tcp buffers, your guaranteed to have a partial "message" (Long) at some point.
c
So is there a test I could do?
t
yeah. there is a easy way to test. Have a writer sending lots of small messages, then have a reader that calls sleep() between each read... a slow reader. You'll see the writer's write() start to block, and the reader() will be getting full buffers every time they call read()
in general, you should make sure there are enough bytes available in the buffer before you try and parse anthing. if their arn't you need to call compact() and read() again until there are.
c
How long of a delay do you reckon?
t
so if you get 4 bytes of your Long in one read() call, the next 4 should be the first 4 bytes of the next read() call. So make sure you hold on the first 4.
dealy for the test? If the writer is writing open loop (as fast as it can), not much. it'll back-presure quick. 10millis should do it.
c
Ok
t
for these kinds of things I write a message that has a 2 byte count, then a count of random ints. I make sure the reader and writer use the same seed in the random # generator and parse all the ints and make sure they match the next int out of the random # generator
c
Is this good enough?
t
but I think you have another problem as well (or I'm missing something else). If you call readLong() and the read() call returns 16 bytes, what happens to the 8 bytes after the long is parsed?
c
readLong reads 8 bytes only
The rest will stay waiting in the channel
t
ahhh.. I see. You set the limit() to the size.
c
Indeed
That test results in a timeout btw
Which makes sense
t
why would it time out?
oh, cuz you have a timeout when you call read().
c
Because eventually the delay between reading/writing is 30 seconds
t
I guess that'll work, but your assuming a lot about the network and the protocol. There are a bunch of protocols you couldn't implement with this.
c
Idk I bet I could figure out them all
t
there are protocols that have active TCP sessions that don't send anything for hours.
like telnet or ssh
c
Oh, as the api advances I'll add the option to change the timeout
t
yeah, and then you'll run into the partical read problem.
c
Eh I'll keep a mental note of this convo and if it does I'll think of something
t
if you take the timeout, your slow reader test will fail at some point.... but it's a valid TCP session with valid flow-control
on another note : calling read() is expensive. Even if it's just going to return queued data. It's much more "normal" and fast to read as many bytes as you can when you call read(), and parse fields out of the buffer.
c
Which is what I do in the actual protocol
But the API also supports reading individual values from the client
I'm adding these extensions to netlius to try to get people to use bytebuffers more often
But if you're using a compressed protocol, you'd be forced to use them at some point anyways and I'll have examples for such aswell
t
You should look at OpendHFT's Chronicle-Bytes. It's a really good wrapper around ByteBuffers
and does off heap memory, and buffer larger than 2gig, etc.
c
Heap memory isn't good for networking from what I heard. And I kinda want to avoid wrappers but if I could use extension functions instead I'll consider it
t
it's got a zillion writers / parers on it. stop-but encoding and the works.
c
Definitely looks interesting
t
yeah. it does on heap or off heap. whatever you want.
c
Thank you for telling me about it, making a comment to look more into it
t
I like it better than netty'd buffer imppl
c
Yeah, Netty is pretty damn fast tho somehow
I wish Kotlinx IO was finished :C
t
control of the threading model....
context switches... brutal...
c
Oof
Fibers would probably fix that in the future tho
t
that's basically what I did w/ mine. It's all coroutines, but only one thread.
it isn't very good wide (10K sockets), but it's fast when shallow.
c
Hm alright
t
all the really go-fast stuff tries to keep the # of threads <= the number of cores., and queues work until a thread is ready.
c
I'm probably gonna use some JNI in the future and make my own EPOLL wrapper, maybe that would help with such
t
that's what netty and those framsworks do. I've done it in the past and didn't see a huge boost over java's epoll impl.
c
Interesting, I've never done JNI yet tho, so maybe it'll be fun regardless
t
except for UDP receive(). it's much faster to do you own and not new a SocketAddress() for every packet.
c
It would be sick if I could use Kotlin/Native in JVM aswell
Then again Kotlin/Native is slow af atm
Oh?
Noted
t
but that's only useful if your trying to do that zero garbage collection stuff. or at least that's where it's very useful.
c
Hm alright
t
there are a bunch of high performace impls that do stack bypass. They NIC just DMA's directly in to main memory, and the rest is up to you
c
Fun
t
OpenOnload is one of the better known ones. https://www.openonload.org/
but yeah a tcpstack in usermode. party on the bits.
c
Bookmarked