When one chats with LLM and the response is received word by kotlinlang #ai

When one chats with LLM, and the response is recei...

ursus

04/21/2024, 11:42 AM

When one chats with LLM, and the response is received word by word - how is that delivered physically to an app, is it a websocket or something different?

raulraja

04/21/2024, 4:21 PM

Server sent events is how the OpenAI chat completion and assistant stream works over http for example.

raulraja

04/21/2024, 4:23 PM

https://platform.openai.com/docs/api-reference/streaming

ursus

04/21/2024, 7:48 PM

Any chance you know, is it better than websockets? Oe is websocket not good for the usecase since only ai response is streamed to the human?

raulraja

04/21/2024, 8:25 PM

Only AI response is streamed (also tools used and other steps) depending if you are using chat completions or something like the assistant api you get function calls streamed or you get events from a run on a thread streamed. The model used is always server sent events that look like this: https://platform.openai.com/docs/api-reference/assistants-streaming/events

raulraja

04/21/2024, 8:28 PM

I've encapsulated this use case of streaming when calling OpenAI for my use cases in Xef, in case it helps here is an example on how we do streaming from llms https://github.com/xebia-functional/xef/blob/main/examples/src/main/kotlin/com/xebia/functional/xef/dsl/chat/Streams.kt here you just handle a Flow from Kotlin

ursus

04/21/2024, 8:31 PM

I see I'm only familiar with websockets from a chat app I worked on.. so the difference is they dont really care about realitime-ness of human chats right? Althoigh one could use websockets for this.. I mean in chat app responses aren't streamed anyways (whatsapp)

raulraja

04/21/2024, 8:34 PM

I guess web sockets could play a role if you had more than two humans in the same chat and everyone typing was real time, also the AI, but currently talking to LLMs is Request/Response or Request/Streaming Response for all use cases I've seen.

ursus

04/21/2024, 8:34 PM

Oh so the EventSource instance is meant for a single reply - not for the whole convo?

raulraja

04/21/2024, 8:35 PM

The stream is ment for a run of a thread (multiple events and decissions being streamed in the assistant api) or a single call to the chat completion api which streams back its response as the LLM produces it, including generating the json for function calls if needed.

ursus

04/21/2024, 8:36 PM

thread = conversation?

raulraja

04/21/2024, 8:36 PM

The whole combo is at least one http request per message you send and response you get back

raulraja

04/21/2024, 8:36 PM

Yes, OpenAI calls conversations threads.

raulraja

04/21/2024, 8:37 PM

https://platform.openai.com/docs/api-reference/threads

ursus

04/21/2024, 8:37 PM

I see, so it's like a cold FlowString per query - it emits all words of the response and completes

raulraja

04/21/2024, 8:38 PM

yes, but the example I sent is for Chat completion api with function call support. If you call the Chat Completions API you have to manage conversation state (send all previous messages earlier), context, tools, message truncation etc...

raulraja

04/21/2024, 8:38 PM

The chat completion API is stateless and does not manage anything for you.

raulraja

04/21/2024, 8:39 PM

https://platform.openai.com/docs/api-reference/chat The chat completions is more used than assistants because is also more portable to other non OpenAI models

ursus

04/21/2024, 8:39 PM

what's the chat completion api .. it's the conversation with llm right?

raulraja

04/21/2024, 8:40 PM

LLMs only return tokens as character and Strings. The chat completion api wraps that to give you a chat like interface where you send instead a list of messages with roles user, assistant, etc then it replies with the next message

raulraja

04/21/2024, 8:41 PM

But it does not manage history of messages or limit

raulraja

04/21/2024, 8:41 PM

Every LLM has a limit in the context window as to the amount of tokens it can ingest and reply with. Chat completion does not manage any of that and it's what you get by most llms

raulraja

04/21/2024, 8:42 PM

Assistants API in OpenAI is built on top of all that to manage the conversation and memory for you

raulraja

04/21/2024, 8:42 PM

But most LLMs can't do assistants, that's why you get frameworks like langchain, or autoagen, they wrap patterns around the chat completion and completions apis

ursus

04/21/2024, 8:43 PM

I see, so it emits the next token but it's my job to keep a store of the tokens

raulraja

04/21/2024, 8:43 PM

and if you had previous messages and want the next call to reply in context you need to send all previous messages on each call, truncating them to not exceed the context window

ursus

04/21/2024, 8:44 PM

but my question was mostly, a single Server-sent event, is only meant to stream the words of what otherwise would be a single reply not a single socket opened for the whole duration of the convo

raulraja

04/21/2024, 8:46 PM

Every message delta contains a chunk of the total message, you get one event per chunk. For example the response:

Copy code

Hello World

May have been streamed in two or more events

raulraja

04/21/2024, 8:47 PM

you keep a long live connection to the http server that accepts an event stream of json

raulraja

04/21/2024, 8:47 PM

every object has { event: "...", data: "..." } or just the data field if all types streamed are of the same type

ursus

04/21/2024, 8:49 PM

but the stream lifetime is a single reply or the whole conversation?

raulraja

04/21/2024, 8:49 PM

a single reply

ursus

04/21/2024, 8:50 PM

i.e. I ask How are you ai then streams "thank" "you" "i am" "well" then then the stream closes? or is the pipe kept open

raulraja

04/21/2024, 8:50 PM

closes and the connection to the server is also closed until you call again

raulraja

04/21/2024, 8:50 PM

every user message triggers one stream for the assistant response

ursus

04/21/2024, 8:51 PM

yea gotcha, so it's basically just ux thing, to not wait so long for the whole thing

ursus

04/21/2024, 8:52 PM

would you use server-sent events in say Stocks app, over websockets?

raulraja

04/21/2024, 8:52 PM

yes, it's all about perceived latency, nobody wants to wait 30 seconds for a long message to be produced, they'd rather see it update as it arrives

ursus

04/21/2024, 8:54 PM

yea makes sense for the use case I'm now trying to figure out where I should have used server events over websockets Probably all the time, when theres only a server-single human session going on right - like in Stocks? (prices updated realtime)

raulraja

04/21/2024, 8:56 PM

would you use server-sent events in say Stocks app, over websockets?

Not sure, all I know is that LLM interaction is server sent events with the providers, but you can probably combine it with web sockets if you need real time updates from other places like stock values or even simple polling if the demand of real time is low.

ursus

04/21/2024, 8:57 PM

well, ws now sounds like when 2 clients are communicating realtime stocks would not fit that anymore i think

ursus

04/21/2024, 9:04 PM

anyways, thank you very much!

👍 1

121 Views

Open in Slack

Previous Next