EleutherAI

Background
EleutherAI (/iˈluθər eɪ. aɪ/) is a decentralized grassroots collective of volunteer researchers, engineers, and developers focused on AI alignment, scaling, and open source AI research. Founded in July of 2020, the flagship project was the GPT-Neo family of models designed to replicate those developed by OpenAI as GPT-3.

The core team consists of Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman.

In Ancient Greek, eleutheria is a word for “liberty”, and was used as a proper noun as a personification of the concept. This same personage became Libertas to the Romans and Lady Liberty to Americans.

On June 8th, 2021 EleutherAI released a 6 billion parameter model trained upon the Pile v1, GPT-J-6B.

On excluding Literotica…
Once upon a time, the plan was to include a dump of Literotica in The Pile, which you can still find here: https://the-eye.eu/public/AI/pile_preliminary_components/ I argued heavily in favor of this, and thought it was totally lame when they decided to drop it.

On excluding the US Congressional Record…
The US Congressional Record had been considered and rejected for inclusion in The Pile. I thought “Big deal, who the hell cares?” while saying “Okay, but I don’t know what that is.”

It’s a written record of all statements made in the US legislature. It was also somewhere between 1GB and 15GB, which would have been a significant portion of The Pile’s total size.

I’m going to quote from her private DMs with me, which I haven’t asked for permission to do. So this is technically another bad move by me. But she put it so perfectly, I was stunned:

> For half the history of the US, black people were slaves. For something like 75% of it, black people didn’t have the right to vote. A modern reader didn’t think there wasn’t a high proportion of extremely racist content, that would primarily be an inditement of modern people lol.

> The reason we first looked at it was that we included a similar document for the EU Parliament

It took me a few minutes to come to my senses, but I finally realized:

(a) this dataset likely contained a huge proportion of content that, politics aside, would be a Very Bad Idea to include in your ML models by default;

(b) Eleuther had just been trying to do good work this whole time

GPT-J quality: HN discussion (2021). A discussion about GPT-J, Books3 creation, and the exclusion of datasets like Literotica and the US Congressional Record…(PDF) (Original HN link)

Get The Memo

by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.
Informs research at Apple, Google, Microsoft · Bestseller in 142 countries.
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.

Alan D. Thompson is a world expert in artificial intelligence, advising everyone from Apple to the US Government on integrated AI. Throughout Mensa International’s history, both Isaac Asimov and Alan held leadership roles, each exploring the frontier between human and artificial minds. His landmark analysis of post-2020 AI—from his widely-cited Models Table to his regular intelligence briefing The Memo—has shaped how governments and Fortune 500s approach artificial intelligence. With popular tools like the Declaration on AI Consciousness, and the ASI checklist, Alan continues to illuminate humanity’s AI evolution. Technical highlights.

This page last updated: 6/Aug/2021. https://lifearchitect.ai/eleutherai/