I want to use the CLIP model to extract feature embeddings from images, then cluster those embeddings, then map those to english descriptions. I understand this to be within the capabilities of the CLIP model, but I'm an ML beginner despite being a seasoned Mobile App Dev.
Can anyone give me a sanity check on whether it's feasible to drive this kind of model and process from Kotlin/DL?
...or maybe this would be better suited to TensorFlow/Java lib, albeit driven by Kotlin.