Is Brave Selling Your Site’s Content to Train AIs?

Jonathan BaileyJuly 18, 2023

6 minutes read

Brave Software is best known for its privacy-focused browser also named Brave. The browser famously blocks many trackers and ads out of the box and has earned itself a reputation for protecting the privacy of its users.

But while its Browser may be the best-known product of the company, it’s routinely ventured into other areas. In March 2021, the company launched its own search engine, Brave Search. By October 2021 the search engine was the default for the Brave Browser and by June 2022 it was taken out of beta.

What makes Brave Search unique is that it is a truly original search engine. Brave manages its own web crawler, its own index and uses its own algorithm to search. This is a major step compared to other privacy-focused search engines, like DuckDuckGo, which use outside sources, such as Microsoft’s Bing, for results.

According to Brave, they removed the last remnants of Bing from their search results, achieving “100% independence” in April 2023.

A month later, in May 2023, Brave released an application programming interface (API) for their search engine, one that enables third parties to create their own search engines based on Brave results and, in one of their pricing tiers, use the output for training artificial intelligence (AI) systems.

However, that API has now come under fire as one developer has called out Brave for, essentially, selling content from third-party websites for training AI systems.

Is that a fair accusation? To find out, we have to first look at what Brave is doing and look at what both the law and the internet norms might say about it.

Understanding the Allegations

The allegations were first published by developer Alex Ivanovs on Stack Diary. According to the original article, he was curious about how search engines were working with AI and noted that Brave offered several API products aimed and companies wanting to train AI systems.

As such, he signed up for Brave’s API to see what kinds of data AI services would have access to. He had two different concerns.

First was a feature in the API entitled “Extra alternate snippets”. With search engines, snippets are the blurbs of text that sit beside a search result. They are usually under 50 words long and are often picked by the web page itself.

The “Extra Alternate Snippets” are made up from text from the web page itself and, according to his initial checks, can include anywhere from 150 to 260 words. In addition to the longer snippets, Brave’s API also offers access to parsed FAQs, discussions and other information, some of which may contain copyright-protected content.

The second issue dealt with the crawler that Brave uses to index sites. According to Ivanovs, Brave does not identify their crawler. This means that a webmaster who doesn’t want Brave to index their site (and thus use its content in their API) would have to block ALL web crawlers, not just Brave.

Brave responded to the post, saying that it “has the right to monetize and put terms of service on the output of its search engine” and that the page content is “is a standard and expected feature of all search engines.”

They go on to say that the decision not to identify their crawler is “practical.” They claim that they “do not have the resources to contact all domain-owners, who-rightfully or not, discriminate against anyone but Google.” They said that they will not crawl any page that is not crawlable by any search engine or by Google’s crawler.

Ivanovs posted an update to the article, clearing up some misunderstasndings from the previous work but further highlighting how different Brave’s results are when accessed via the API. In one test, he found the number of words returned for a query ballooned from about 500 words when accessed via the web to 1,612 words when accessed via the API. That is a 3x increase in the amount of content being passed on.

He further showed that much of this content was not licensed for commercial use, raising doubts about the legality of the use. According to Ivanovs, “Brave is under the assumption that 1) because they are a search engine and 2) because they attribute the URI of data – this puts them in the clear to scrape and resell data word-for-word.”

Since then, the story has been further picked up by Matt G. Southern at Search Engine Journal and Anirudh VK at Analytics India Magazine.

Difficult Questions, Difficult Answers

Examining Brave’s actions from an ethical and legal standpoint, I find their actions to be extremely dubious.

Starting with their crawler, not identifying your bot may be “practical” but it is also extremely questionable.

A bot is an automated script that accesses someone else’s server for the purpose of parsing the content for whatever purpose the bot is working for. The person who owns that server and created that content can and should have the right to control who accesses those resources.

While it may be unfair that there are those that want to block everyone other than Google, that is their choice. That is completely their right, and Brave’s current policies make that impossible. If you allow Googlebot, you have to allow Brave. That is taking away choices from content creators and copyright holders.

Brave could offer a variety of incentives to make people want Brave’s crawler to access their site (something that tried for creators when blocking ads). Instead, they opted to make it impossible to say no, at least not without blocking other desirable bots.

As for their Extra Alternate Snippets, that’s an issue that must be taken case by case. Taking 200 words from a 250-word article is different from taking 150 words from a 100,000-word book.

As we discussed back in May, the recent Supreme Court ruling in the Warhol case has shifted the narrative of fair use in ways that we don’t and can’t fully understand. However, one of the impacts that is clear is that it’s weakened the most common fair use arguments AI companies have made when defending their use of unlicensed copyrighted material: Namely that the use is transformative.

Given the sheer volume of content that Brave has indexed, it’s almost unthinkable that absolutely none of it would raise serious copyright issues being redistributed this way.

As for any implied license for distribution, in May 2021, the 11th Circuit Court of Appeals upheld a lower court decision that having a full RSS feed, something that enabled publishing on other sources, was not by itself an implied license to redistribute content. A web page, which is not in such a format, would likely enjoy even more protection.

Those arguments would be further hurt by the fact many creators don’t know that Brave is accessing their site and, even if they do know, can’t block them without also blocking other bots.

While many of the uses of this special snippet may be a fair use, when you’re dealing with over 8 billion pages, even if only 1% weren’t, that would still be over 80 million potential infringements.

Any one of those could be potentially damaging.

Bottom Line

Brave is a company I’ve respected for many years. Even when I disagreed with them, they’ve always came across as a company that wants to make the internet a better place.

That’s what makes these decisions so painful. Not identifying their crawler and copying/pasting hundreds of words of content without permission look, to me, to be cynical moves made by a company that doesn’t care what webmasters, authors or other creators want.

It’s true that, in the big scheme of things when it comes to AI, this is not a major issue. The large AI firms have much better ways to scrape and collect content from websites than using Brave’s Search API and its lengthier snippets. Still, it still puts Brave in a bad light.

Theoretically, this would be simple to fix. Simply identify your crawler, accept those that block it, and don’t offer the extended snippets through the API.

Yes, it might make Brave Search less valuable, but it would largely eliminate the ethical and legal cloud that hangs over the current approach.

It would also be more in keeping with their mission statement, which says in part, “Big Tech makes huge profits off our data, and tells us what’s true and what’s not. Brave is fighting back.”

Right now, Brave IS the big tech making profits off our data. It’s just the data of webmasters and creators, not their browser users.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free