Select Page

There is cause for concern about misinformation from LLM, but the creators of these platforms and influencers also spread misinformation about the Ai itself. Words and phrases are being redefined.

People including Mark Zuckerberg, Meta CEO, have described their locally-executable LLMs as “open source” and this assertion is then echoed by online influencers. I operate on the premise that most people don’t know what open source actually means. Sure, there may be an occasional article on IEEE Spectrum that has it right, but that’s not mainstream. YouTube is more approachable.

Matt Berman echoes the farcical “open source” phrase when describing local LLMs. But can we blame him? Anyone can upload to YouTube. He’s trying to go viral like anyone else. It’s about filling the content void. He probably only wants to build wider viewership. Research from Google Trends may tell us more about who is searching.

Open source traditionally meant something specific. In human-developed software, the source was the legible code that could be used to compile into a machine-executable code. In Ai software, the computer trains or develops itself using a massive, proprietary dataset and that data is not open to the public. Now we’ll cut to Elon Musk.

Musk alleges that Ai is being trained on copyrighted data. The New York Times V. OpenAI case lays out some powerful evidence to support this idea, too.

These allegedly “open source” models do not disclose their training data. If they did, copyrighted data would be clear for all to read. If we had the training data and the code that determines how the system is trained, we could hypothetically train or compile the model ourselves. It is unlikely that companies would ever allow us to have that on legal grounds alone, but there are other reasons.

Aside from legal liabilities, there are strong technical reasons we probably won’t see training data or methods released to the general public. Here are the top 3 reasons. These are unlikely to change.

Hosting Expense

The training data is massive, likely petabytes. Hosting this data to make it available for the public, even for controlled-access downloads would be absurdly expensive. It might be more feasible to send physical hardware, like Snowball, but in reverse.

Storage Challenges

If you want to train on it or compile the dataset for yourself, you’ll need high-performance storage and lots of it. For maximum IOPS, you’ll probably want physical access to the storage hardware. Unless you are a massively wealthy instition, the storage costs are prohibitive. 

Processing Infeasibility

Storing the data is one thing, but the training operations require an entirely different class of hardware, like NVidia’s H100 chips that are so difficult to acquire that Sam Altman, OpenAI CEO, wants to build his own. Don’t worry, it’ll only cost $7 trillion. You’re good for it, right?

Sarcasm aside, you can use an already-trained model for inference. You can feasibly access a model in the cloud using a browser or run it on local hardware. Different models allow different possibilities for inference.

In the beginning, there was cloud-based model inference. These cloud-based models include OpenAI’s ChatGPT, Google’s Bard rebranded as Gemini, and Musk’s Grok experiment through Twitter (now X). Most people in the tech world probably already know about these models and they have been widely covered by mainstream news sources.

Cloud-based Ai inference is nice because it works in a browser. This means the heavy lifting is done by someone else’s computer, likely a massive server farm. But new chips from Intel and AMD offer an integrated NPU that can use main memory rather than VRAM found in a more expensive dGPU. Cloud-based Ai has a new challenger, local inference.

In the near future or perhaps the present, local Ai models can use the hardware on your desk. Platforms like LM Studio have created highly-accessible UI that avoids the learning curve associated with terminal commands. It’s shockingly easy to run on Apple Silicon. Some of these models include Mistral, Meta’s Llama 2, and Microsoft’s Phi-2.  These models have, understandably, not received coverage by mainstream news sources. This could soon change.

But the mischaracterization of local inference as open source Ai should conern us. The attempted redefinition of the phrase open source could ultimately muddy public perceptions. The gaps between reality and non-sensical marketing jargon are where the metaphorical bubbles form and grow. Bubbles have a tendency to pop.

There is a tendency in English to lose or depart from the original meaning of words and phrases. A middle-english professor I knew remarked that the meaning of the word skyline resembled a portmanteau, but ultimately described something neither sky nor line. Perhaps the brand OpenAI or the phrase open source Ai is not so different a case.

We know that Ai still fails in basic recursive logic in ways that humans typically wouldn’t.  Some models even appear to become worse as they are expanded. There’s also data to suggest that adding synthetic training data, data that was not generated by humans, can lead to model collapse. It’s important to acknowledge the weaknesses of Ai systems, especially if we expect them to replace human workers. We also know that humans can be relatively easy to deceive.

The Turing Test is probably new to many, but it should serve as a warning and proof that humans are relatively easy to trick. ELIZA could fool humans back in 1966. These systems are specifically designed to mimic human writers, but that doesn’t mean they are human.

Ai has weaknesses and so do humans. Science fiction writers have already spent decades exploring ways this type of technology could compromise human nature. 

We should exercise caution with Ai marketing. It appears the phrase open source has been both intentionally and unwittingly redefined to make Ai models appear safer or more transparent. These models are all trained on massive proprietary data we will likely never see. We can only speculate about what the original datasets contained based on the way the model responds to specific types of prompts. We should challenge what we’re told. It’s best not to be fooled.

Meta logo and blue Llama on blue background.

Full disclosure: Devin Crutcher owns META stock and a few of its VR headsets.