In a surprising move, Apple, in collaboration with Columbia University, recently introduced Ferret, an open-source multimodal large language model designed to excel in referring and grounding tasks. Unlike Apple’s usual closed-door approach, the company quietly unveiled Ferret on GitHub, signaling a commitment to advancing multimodal artificial intelligence (AI) and encouraging potential collaborations within the development community.
You can check out the repository here.
Ferret, equipped with the capability to refer to image regions in various free-form shapes, autonomously establishes grounding for text identified as groundable by the model. The researchers behind Ferret utilized the GRIT dataset, comprising 1.1 million samples enriched with hierarchical spatial knowledge and 95,000 hard negative data points to enhance model robustness during training.
The resulting model demonstrated superior performance in classical referring and grounding tasks, surpassing existing multimodal large language models (MLLMs) in region-based and localization-demanding multimodal interactions. The researchers noted significant improvements in describing image details and a notable reduction in object hallucinations. However, they acknowledged the potential for Ferret, like other MLLMs, to produce harmful or counterfactual responses.
To further enhance Ferret’s capabilities, the researchers, citing LISA, outlined plans to incorporate features enabling the model to output segmentation masks and bounding boxes. This development aligns with Apple’s strategic goal of staying at the forefront of multimodal AI technologies.
Ferret’s functionality is both elegantly simple and powerful, enabling the model to identify elements within an image and establish contextual connections for responding to user queries. This versatility opens up possibilities in diverse applications, including image search, accessibility features, and other scenarios requiring nuanced contextual understanding.
One noteworthy aspect of Ferret’s introduction is Apple’s departure from its conventional closed-door strategy, opting for an open-source development approach. This not only showcases the company’s dedication to innovation but also creates opportunities for collaborative efforts and community-driven advancements in the field of multimodal AI. The stealthy debut of Ferret underscores Apple’s commitment to pushing the boundaries of AI capabilities while fostering an environment of shared development and innovation.