ImageBind

ImageBind (opens in a new tab) learns a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation.

We provide the mosec inference service template.

# CLI doesn't support multiple words now
export MODELZ_API_KEY=mzi-abcdefg...
modelz inference --deployment imagebind-XXX --serde msgpack model=imagebind-text input="A dog"

Python Client

import modelz
 
APIKey = "mzi-abcdefg..."
 
cli = modelz.ModelzClient(deployment="imagebind-XXX", key=APIKey)
 
input = {"model": "imagebind-text", "input": ["A dog", "doggery", "puppy"]},
resp = cli.inference(params=data, serde="msgpack")
embeddings = resp["data"]
for emb in embeddings:
    print(emb['embedding'])

The template supports image, audio, and video data embedding. Please refer to our example (opens in a new tab) for more embedding, or better type hint (opens in a new tab) of modelz Request and Response.

Mosec Stable Diffusion