A modern programming language that makes developers happier.

kotlinlang

Hello everyone, I'd like to know how I can split documents for a rag database like pgvector and how I can generate blocks for it? I couldn't find it in the documentation.

Hello Renato. You will need an external framework for the ingestion of PDF files, e.g. Apache Tika which allows to parse many file types including PDF files. That will allow you to extract the text and images. You can then split the text in appropriate chunks and analyse the images with a vision model. Those chunks of texts can then be vectorized with an embedding model and fed to your vector database.

Unfortunately, Koog doesn't support all of the ingestion pipeline out of the box. We are still missing an integration with Apache Tika, a text splitter and also we don't have an implementation of the VectorStore interface for Pgvector. Although, those would be great contributions to Koog (starting with the Pgvector VectorStore implementation for instance), if you feel like it!