EleutherAI (/iˈluθər eɪ. aɪ/) is a decentralized grassroots collective of volunteer researchers, engineers, and developers focused on AI alignment, scaling, and open source AI research. Founded in July of 2020, the flagship project was the GPT-Neo family of models designed to replicate those developed by OpenAI as GPT-3.

The core team consists of Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman.

In Ancient Greek, eleutheria is a word for “liberty”, and was used as a proper noun as a personification of the concept. This same personage became Libertas to the Romans and Lady Liberty to Americans.

On June 8th, 2021 EleutherAI released a 6 billion parameter model trained upon the Pile v1, GPT-J-6B.

On excluding Literotica…
Once upon a time, the plan was to include a dump of Literotica in The Pile, which you can still find here: https://the-eye.eu/public/AI/pile_preliminary_components/ I argued heavily in favor of this, and thought it was totally lame when they decided to drop it.

On excluding the US Congressional Record…
The US Congressional Record had been considered and rejected for inclusion in The Pile. I thought “Big deal, who the hell cares?” while saying “Okay, but I don’t know what that is.”

It’s a written record of all statements made in the US legislature. It was also somewhere between 1GB and 15GB, which would have been a significant portion of The Pile’s total size.

I’m going to quote from her private DMs with me, which I haven’t asked for permission to do. So this is technically another bad move by me. But she put it so perfectly, I was stunned:

> For half the history of the US, black people were slaves. For something like 75% of it, black people didn’t have the right to vote. A modern reader didn’t think there wasn’t a high proportion of extremely racist content, that would primarily be an inditement of modern people lol.

> The reason we first looked at it was that we included a similar document for the EU Parliament

It took me a few minutes to come to my senses, but I finally realized:

(a) this dataset likely contained a huge proportion of content that, politics aside, would be a Very Bad Idea to include in your ML models by default;

(b) Eleuther had just been trying to do good work this whole time

GPT-J quality: HN discussion (2021). A discussion about GPT-J, Books3 creation, and the exclusion of datasets like Literotica and the US Congressional Record…(PDF) (Original HN link)

