We Benchmarked Liminary Against ChatGPT + Google Drive. It's Not Close.

Date

Apr 14, 2026

Reading time

5 minutes

Author

Thomas Berg, Liminary Engineering

Liminary's proactive recall interface surfacing relevant saved research notes in real time

Liminary scores 89% accuracy on real-world multi-document Q&A. ChatGPT with Google Drive scores 63%. Liminary is 3.4x as accurate as ChatGPT.

We built a real world question-answering benchmark using real user research documents to test whether AI-native storage actually outperforms bolting an LLM onto traditional file storage. It does.


What is AI-native storage?

Liminary is AI-native storage. We don't bolt an AI layer onto a file system that was built for something else. We built the storage system from the ground up for how AI agents actually search and reason over information.

You can easily try a bolt-on system. Give ChatGPT access to your Google Drive using the Google Drive Connector. Then when you ask questions, it will use Google Drive's search capabilities to find information to base its answers on. It works okay. But Google didn't build Drive for this. The same is true of Box, OneDrive, Dropbox, and every other storage platform now rushing to add AI features. They were built to store and sync files, not to help an AI reason across them.

We built Liminary with this in mind. When you save a source in Liminary, we immediately run it through an extraction process to build an understanding of the content. We add the results to our storage system, preserving context and relationships that traditional file systems lose: a quote from one interview connected to a contradicting claim in another, a pattern that emerges across five separate documents and meeting transcripts. This architecture is tailored to how AI agents perform searches; it's not a traditional keyword index or embedding store, though it includes both of those. It's built so that when the AI is searching your information later, it efficiently finds all the information it needs.

For users doing research and analysis across large document collections, accuracy and completeness aren't nice-to-haves. When a system misses a relevant source or returns a partial answer, downstream decisions are built on incomplete information. We heard this consistently in our user research: people want to trust AI tools for synthesis work, but they don't yet. So we built a benchmark to hold ourselves accountable.

The dataset

Since we started working on Liminary, we've conducted a large number of interviews to learn about how our primary users, independent consultants, strategists, and researchers do their work. We collected transcripts and notes from these interviews, and wrote a hundred factual, short-answer questions that could be answered based on that corpus. 

Questions like:

Which interviewees are skeptical about using AI for writing?

What tools do people use for capturing meeting transcripts?

Then we saved all the transcripts and notes to Liminary, asked Liminary the questions one by one, and recorded the answers. We gave ChatGPT access to a Google drive with the same documents and asked it the same questions. Then we scored both systems.

Results

As you can see from the table below, Liminary's agent with AI-native storage is more accurate by a wide margin.

System

Accuracy

Mean latency (sec)

Liminary

88.9%

7.3

GPT-5.4 (default effort) with Google Drive

63.0%

11.2

GPT-5.4-e (increased effort) with Google Drive

81.8%

32.0

Accuracy of each system on the benchmark. We give 1 point for each fully correct answer. When the correct answer is a list (like the example questions above), we also give partial credit. The final accuracy is the sum over the 100 questions in the dataset.

Increasing ChatGPT's “effort” setting beyond the default improves accuracy, but leads to tripled response times and higher inference cost. While we can’t know exactly what OpenAI is doing under the hood, higher effort likely means reading more documents and expending more tokens verifying answers. That can close the gap, but it's a brute force approach. Liminary gets better accuracy in less time because the retrieval architecture does the heavy lifting, not the LLM.

The accuracy differences are even more stark if we focus on hard questions. Comparing Liminary and GPT-5.4 head-to-head, there are 52 "easy" questions where both systems get full credit. We'll call the remaining 48 questions "hard." What do easy and hard questions look like? Looking through all the questions, two points stand out.

  1. Easy questions are usually based on a single document. Hard questions require information from multiple documents

  2. Easy questions are often "search-friendly", including a keyword that's present only in the relevant documents. Often this is a proper noun and occurs in the document title. Hard questions require more understanding to identify the right documents.

Here are two sample questions (the name is changed).

Easy: What note-taking app does Sarah Smith use?

Hard: Which users expressed concern about privacy of client data when using AI?

This dataset mostly has one document per interview, with the interviewee's name in the title, so it's easy to find the right document for the first question. The second question includes nothing matching a document title, and requires the system to find all the interviews that covered data privacy, distinguish conversations about client data vs the user's own data, and understand which situations count as "using AI." This is feasible with a storage system where much of this information is pre-extracted from the original documents. It's difficult and slow if all you have is a keyword index.

The table below shows results on the hard question subset. Liminary scores 76.8% on hard questions, 3.4x as good as ChatGPT.

System

Accuracy

Mean latency (sec)

Liminary

76.8%

5.6

GPT-5.4 (default effort) with Google Drive

23.0%

16.8

GPT-5.4-e (increased effort) with Google Drive

62.0%

47.0

Accuracy of each system on hard questions.

How we do it

We won't describe the technical details of Liminary's system here, but a recent viral post by Andrej Karpathy captures a similar idea. He describes an "LLM Knowledge Base," where an AI model reads your documents and builds a wiki optimized for its own use. This wiki includes summaries, links between documents, etc. Later, when the LLM needs to answer a question based on this knowledge, it can find what it needs quickly and reliably because of the work it did up front. Broadly speaking, this is what we're doing at Liminary at scale. We start with file upload or ambient capture via our Chrome extension, extracting information from your sources as you save them. We then synthesize that knowledge, learning from the notes you add and questions you ask to find links, corroborations, and contradictions between sources. Finally, we save it all in a format that allows quick, accurate retrieval by our agent system, enabling you to answer questions and generate deliverables directly within the system. This end-to-end workflow surfaces knowledge precisely when you need it, providing a 'warm start' for your final output. And because Liminary also builds memory from your interactions, the questions you ask, the notes you add, the connections you flag, retrieval gets sharper over time. The system learns how you work, not just what you've saved.

There's much more to do

Memory is one of the most active areas in AI right now, and for good reason. Projects like Mem0, Zep, and Letta are building systems to give LLMs persistent conversational memory: what did you discuss, what decisions did you make, what does this agent already know about you. That's important work. But conversational memory is only one piece. For consultants and researchers, the harder problem is the memory that lives across hundreds of saved documents, interview transcripts, reports, Slack threads, and ideas you've accumulated over months of work. How those sources connect, where they contradict, what patterns emerge across them. That's what Liminary's AI-native storage is built to solve, and this benchmark is our first step in proving it publicly. We're expanding our evaluations to cover more document types, longer-form synthesis tasks, and the kinds of messy real-world collections our users actually have. We will share more as we go.

Thanks for coming along.