Editor’s note: Paul Arnold is a Venture Investor and the Founder of Switch Ventures, a San Francisco-based seed-stage venture fund with a data-driven bent. Prior to launching the fund in 2016, Paul oversaw operations at AppDirect during a period where the company scaled twenty-fold. Previous to this, he was a Consultant at McKinsey & Company, where he built and led high-performing teams and honed his understanding of deep tech.

When OpenAI launched ChatGPT in late 2022, it represented a giant leap forward in generative AI and a stark contrast to the slow, incremental advances that most AI companies had been making up until then. More than just a technical innovation, ChatGPT’s arrival was a wakeup call. It signaled that something really big was coming and solidified AI’s position as a core tech transition poised to usher in a suite of new products and drive economic growth for decades to come.

As a venture investor, my job is to tap into that growth. To do this, I need to identify opportunities where venture-backed start-ups can win versus those that favour large incumbents and their massive data sets. To succeed, gen AI startups will need to have an enduring and defensible value proposition. 

Why defensibility will often hinge on private datasets

Since ChatGPT arrived on the scene there has been an explosion of new generative AI startups and an equally marked increase in venture funding. According to CBInsights data, equity investors poured $21.8 billion into generative AI deals in 2023, five times more than the year before. Much of that funding went to early-stage companies. 

In my position, I talk to early-stage generative AI companies all of the time. The biggest issue I see with the vast majority of them is that their products are untenable. Typically wrappers around ChatGPT or Gemini, they might take the form of a specialized prompt that makes it easier for users to get answers to highly specific questions than if they were to query one of those large language models (LLMs) on their own. While there’s some value in that, the problem is that products like these are effectively just a user interface that integrates with a core technology the startups don’t own. That means they’re not defensible because it wouldn’t be all that hard for incumbents or other startups to replicate them. 

To succeed in generative AI, founders need to ensure there’s a moat around their business. Traditional approaches to building moats, such as developing a product that becomes their customers’ systems of record (i.e., indispensable) or that benefit from a network effect, work well but are difficult to pull off. Often, the more viable option is to create something that no one else can by training AI models on private and privileged datasets rather than relying on the general, publicly available datasets that ChatGPT, Gemini, and other LLMs use. Importantly, those private datasets will be most valuable if they can be used to address problems or perform tasks where existing LLMs have blindspots or gaps in their data. Typically those blindspots are vertically specific. Take a highly specialized task such as drafting legal contracts, for example. Today’s LLMs simply aren’t very good when tasked with this kind of work.

Different paths to access private data

The question gen AI startups need to ask themselves is how can they acquire the massive, private data sets necessary to create a defensible AI model? Incumbents have been able to do so organically as a result of having customers use their products for years. That’s how Zendesk has amassed a treasure trove of data about customer support tickets, just as Stripe has about payments. Startups can, of course, follow suit, build a solution of their own, and begin collecting user data. While highly effective, in an industry where having a first-mover advantage is critical, the reality of having to overcome a classic cold start problem like this may not be practical.

Another option will be for startups to ingest all of their customers’ data and use it to develop insights that are unique to each customer. Yet another possibility is for AI startups to acquire the data they need through partnerships. Imagine a fictitious startup called Accounting.ai that has partnered with a Big Four accounting firm and gained access to all of its internal documents and accounting records. Using that enormous private and privileged dataset, Accounting.ai would be able to build an AI that is smarter than ChatGPT (which doesn’t have the benefit of being trained on all of that data), and therefore has the potential to be both enduring and defensible.

Of course, partnerships present their own challenges too. First, there is the question of whether incumbents will be willing to share their data with partners or if they will opt to launch their own tech spinouts. And, even if they are open to partnerships, what will the cost to the startup be? After all, incumbents are typically conservative organizations with highly sensitive data. They will be extremely cautious about sharing that data, particularly if there is even the slightest chance that any of it could be leaked, and will demand an enormous price if persuaded to do so.

While it remains to be seen how founders and the owners of large private datasets might actually work together, what is clear is that each of these datasets is a potential goldmine. As such, incumbents will be highly incentivized to figure out the best way to monetize their data, whether on their own or in partnership with others. 

The exciting (but uncertain) road ahead

AI will reshape the world and drive economic growth for decades to come. Exactly what shape that will take is unclear. What is clear, however, is that for startups to have any chance of displacing incumbents, they can’t simply integrate their models with public datasets. To ensure their seat at the table, they will need to have an enduring and defensible value proposition. That, in turn, will depend on their ability to train their models on private datasets that allow them to perform highly specific tasks. It’s a tall order, but one that will unlock billions of dollars in value for those founders who can pull it off.