When one chats with LLM, and the response is recei...
# ai
u
When one chats with LLM, and the response is received word by word - how is that delivered physically to an app, is it a websocket or something different?
r
Server sent events is how the OpenAI chat completion and assistant stream works over http for example.
u
Any chance you know, is it better than websockets? Oe is websocket not good for the usecase since only ai response is streamed to the human?
r
Only AI response is streamed (also tools used and other steps) depending if you are using chat completions or something like the assistant api you get function calls streamed or you get events from a run on a thread streamed. The model used is always server sent events that look like this: https://platform.openai.com/docs/api-reference/assistants-streaming/events
I've encapsulated this use case of streaming when calling OpenAI for my use cases in Xef, in case it helps here is an example on how we do streaming from llms https://github.com/xebia-functional/xef/blob/main/examples/src/main/kotlin/com/xebia/functional/xef/dsl/chat/Streams.kt here you just handle a Flow from Kotlin
u
I see I'm only familiar with websockets from a chat app I worked on.. so the difference is they dont really care about realitime-ness of human chats right? Althoigh one could use websockets for this.. I mean in chat app responses aren't streamed anyways (whatsapp)
r
I guess web sockets could play a role if you had more than two humans in the same chat and everyone typing was real time, also the AI, but currently talking to LLMs is Request/Response or Request/Streaming Response for all use cases I've seen.
u
Oh so the EventSource instance is meant for a single reply - not for the whole convo?
r
The stream is ment for a run of a thread (multiple events and decissions being streamed in the assistant api) or a single call to the chat completion api which streams back its response as the LLM produces it, including generating the json for function calls if needed.
u
thread = conversation?
r
The whole combo is at least one http request per message you send and response you get back
Yes, OpenAI calls conversations threads.
u
I see, so it's like a cold FlowString per query - it emits all words of the response and completes
r
yes, but the example I sent is for Chat completion api with function call support. If you call the Chat Completions API you have to manage conversation state (send all previous messages earlier), context, tools, message truncation etc...
The chat completion API is stateless and does not manage anything for you.
https://platform.openai.com/docs/api-reference/chat The chat completions is more used than assistants because is also more portable to other non OpenAI models
u
what's the chat completion api .. it's the conversation with llm right?
r
LLMs only return tokens as character and Strings. The chat completion api wraps that to give you a chat like interface where you send instead a list of messages with roles user, assistant, etc then it replies with the next message
But it does not manage history of messages or limit
Every LLM has a limit in the context window as to the amount of tokens it can ingest and reply with. Chat completion does not manage any of that and it's what you get by most llms
Assistants API in OpenAI is built on top of all that to manage the conversation and memory for you
But most LLMs can't do assistants, that's why you get frameworks like langchain, or autoagen, they wrap patterns around the chat completion and completions apis
u
I see, so it emits the next token but it's my job to keep a store of the tokens
r
and if you had previous messages and want the next call to reply in context you need to send all previous messages on each call, truncating them to not exceed the context window
u
but my question was mostly, a single Server-sent event, is only meant to stream the words of what otherwise would be a single reply not a single socket opened for the whole duration of the convo
r
Every message delta contains a chunk of the total message, you get one event per chunk. For example the response:
Copy code
Hello World
May have been streamed in two or more events
you keep a long live connection to the http server that accepts an event stream of json
every object has { event: "...", data: "..." } or just the data field if all types streamed are of the same type
u
but the stream lifetime is a single reply or the whole conversation?
r
a single reply
u
i.e. I ask How are you ai then streams "thank" "you" "i am" "well" then then the stream closes? or is the pipe kept open
r
closes and the connection to the server is also closed until you call again
every user message triggers one stream for the assistant response
u
yea gotcha, so it's basically just ux thing, to not wait so long for the whole thing
would you use server-sent events in say Stocks app, over websockets?
r
yes, it's all about perceived latency, nobody wants to wait 30 seconds for a long message to be produced, they'd rather see it update as it arrives
u
yea makes sense for the use case I'm now trying to figure out where I should have used server events over websockets Probably all the time, when theres only a server-single human session going on right - like in Stocks? (prices updated realtime)
r
would you use server-sent events in say Stocks app, over websockets?
Not sure, all I know is that LLM interaction is server sent events with the providers, but you can probably combine it with web sockets if you need real time updates from other places like stock values or even simple polling if the demand of real time is low.
u
well, ws now sounds like when 2 clients are communicating realtime stocks would not fit that anymore i think
anyways, thank you very much!
👍 1