Phi3VisionSession
Phi-3 Vision multimodal session for image understanding and generation. Uses three separate ONNX models: vision encoder, text embedding, and text decoder. Accepts raw image bytes (JPEG/PNG) and pre-tokenized prompt tokens split into prefix (before image placeholder) and suffix (after image placeholder) segments. Example model_dir := "phi3v-directml/"; session := Phi3VisionSession->New( model_dir + "phi-3-v-128k-instruct-vision.onnx", model_dir + "phi-3-v-128k-instruct-text-embedding.onnx", model_dir + "model.onnx"); image_bytes := System.IO.File.FileReader->ReadBinaryFile("photo.jpg"); prefix := [32010, 29871]; suffix := [32007, 32001]; eos := [32000, 32007]; result := session->Generate(image_bytes, prefix, suffix, 256, 0.0, eos); each(token in result->GetTokens()) { token->PrintLine(); }; session->Close();
Operations
Generate
Generate text tokens from an image and prompt tokens. The prompt is split into prefix tokens (before the image) and suffix tokens (after the image).
method : public : Generate(image_bytes:Byte[], prefix_tokens:Int[], suffix_tokens:Int[], max_tokens:Int, temperature:Float, eos_tokens:Int[]) ~ API.Onnx.Phi3ResultParameters
| Name | Type | Description |
|---|---|---|
| image_bytes | Byte | raw image file bytes (JPEG/PNG) |
| prefix_tokens | Int | token IDs before the image placeholder |
| suffix_tokens | Int | token IDs after the image placeholder |
| max_tokens | Int | maximum number of tokens to generate |
| temperature | Float | sampling temperature (0.0 for greedy) |
| eos_tokens | Int | array of end-of-sequence token IDs |
Return
| Type | Description |
|---|---|
| Phi3Result | generation result with output token IDs |
New
Constructor.
New(vision_model:String, embed_model:String, decoder_model:String)Parameters
| Name | Type | Description |
|---|---|---|
| vision_model | String | path to vision encoder ONNX model |
| embed_model | String | path to text embedding ONNX model |
| decoder_model | String | path to text decoder ONNX model |
New
Constructor with configuration.
New(vision_model:String, embed_model:String, decoder_model:String, config:Map<String,String>)Parameters
| Name | Type | Description |
|---|---|---|
| vision_model | String | path to vision encoder ONNX model |
| embed_model | String | path to text embedding ONNX model |
| decoder_model | String | path to text decoder ONNX model |
| config | Map<String,String> | session configuration parameters |