This Startup is Building an ETL for LLMs

Unstructured - This Startup is Building an ETL for LLMs

In recent years since the ChatGPT/LLMs revolution started, the demand for clean and structured data has never been higher. To solve this, Unstructured, a San Francisco, CA-based provider of tools to ingest and preprocess large language models (LLMs), is building an ETL-like platform for LLMs with its innovative approach to data extraction and transformation in the fields of data analytics and artificial intelligence. Unstructured is on a mission to make enterprise data LLM-ready, and they’re doing it like no one else.

Recent funding of $40 million

Unstructured has also secured $40 million in Series B funding recently to enhance its efforts. The funding round was led by Menlo Ventures and saw participation from notable entities such as Databricks Ventures, IBM Ventures, and Vivek Ranadivé, Chairman of the Sacramento Kings. Also contributing were key figures like Chet Kapoor, CEO of Datastax, Allison Pickens from the New Normal Fund, and NVentures, NVIDIA’s venture capital arm. Existing investors, including Madrona, Bain Capital Ventures (BCV), and Mango Capital, also joined. Tim Tully from Menlo Ventures has joined the board of directors as part of this investment, bringing the total capital raised by Unstructured to $65 million.

The infusion of capital is earmarked for expanding the company’s team and expediting the development of data preprocessing tools tailored for LLMs.

Under the leadership of Brian Raymond, the CEO, and Founder, Unstructured specializes in providing solutions for preprocessing LLM data, enabling organizations to seamlessly convert their unstructured data into formats compatible with large language models. By automating the conversion process of intricate natural language data commonly found in PDFs, PPTX, HTML files, and more, the company empowers enterprises to harness the full potential of their data, thereby enhancing productivity and fostering innovation.

What does ‘Unstructured’ do? and How it Works?

A staggering 80% of enterprise data exists in formats that are notoriously difficult to utilize effectively, ranging from HTML and PDF to CSV, PNG, and PPTX. Unstructured has taken up the challenge of effortlessly extracting and transforming such complex data, ensuring compatibility with every major vector database and LLM framework.

“We connect enterprise data to LLMs, no matter the source,” says the team at Unstructured. Their enterprise-grade connectors are designed to capture data from diverse sources, enabling seamless transformation into AI-friendly JSON files. This capability is a game-changer for companies eager to integrate AI into their operations, as Unstructured delivers meticulously curated data that is free of artifacts and, most importantly, LLM-ready.

Unstructured boasts the ability to handle any document, file type, or layout, emphasizing the critical role clean, curated data plays in maximizing the potential of large language models. By streamlining the process of data preprocessing, Unstructured empowers data scientists to dedicate more time to modeling and analysis, rather than getting bogged down by the tedious task of cleaning and formatting data.

With an impressive track record that includes over 6,000,000 downloads, adoption by more than 50,000 companies, and multiple government contracts, Unstructured has quickly emerged as the tool of choice for a growing community of data scientists and engineers.

Products Tailored for Success:

  1. Unstructured Platform: Unstructured’s flagship offering efficiently extracts and transforms data into clean, consistent JSON format, primed for integration into vector databases. This ensures that LLMs are always up-to-date and capable of understanding organizational data seamlessly.
  2. API (SaaS & Marketplace): Unstructured offers APIs for single-batch, production-grade document preprocessing, eliminating the need for custom code. Whether hosted by Unstructured or in customer AWS or Azure VPCs, these APIs simplify the data transformation process.
  3. Platform (Paid – Beta): Designed for enterprises and high-growth companies with substantial data volumes, this platform automates data retrieval, transformation, and staging for LLMs, promising unparalleled efficiency.

A Focus on Clean Data:

At the heart of Unstructured’s mission lies a commitment to delivering clean, model-ready data. Their innovative element classification system ensures optimal performance with minimal hallucinations, setting a new standard for LLM accuracy and reliability.

Since its establishment in 2022, Unstructured has been at the forefront of productizing enterprise LLMs, empowering organizations to efficiently automate the conversion of their disorganized, unstructured data into formats essential for retrieval augmented generation (RAG) and LLM fine-tuning. Its open-source library has garnered over 6 million downloads, serving more than 12,000 code bases. Moreover, over 45,000 organizations, including a significant portion of the Fortune 500 companies, rely on Unstructured to preprocess their proprietary data, underscoring the company’s pivotal role in data transformation within the industry.

Unstructured is enhancing the way enterprises harness the power of data for AI applications. By simplifying the data preprocessing pipeline and prioritizing data cleanliness, they’re paving the way for a future where LLMs can unlock unprecedented insights with ease. With Unstructured, your journey to LLM success begins with clean, curated data – the foundation of AI excellence.

For more such insightful news & updates around AI or Automation, explore other articles hereYou can also follow us on Twitter.

Leave a Reply

Your email address will not be published. Required fields are marked *