Content

A visual multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-VL accepts image, text, and bounding box as inputs, outputs text and bounding box.

Summary
Alibaba Cloud has introduced Qwen-VL, a visual multimodal version of the large model series. Qwen-VL can process inputs such as images, text, and bounding boxes, and provides outputs in the form of text and bounding boxes.