Integrated AI: Dataset quality vs quantity via bonum (GPT-4 and beyond)

Author’s note: Mar/2022: I have since changed my views on this subject. I would like my AI (AGI) to know about everything—from Nazi Germany to 4chan—before finetuning and ‘learning’. I appreciate that there are several lenses through which to view this debate, and this paper presents a different perspective through my earlier lens. The paper is restored here for interest.

Download original article Integrated AI: Dataset quality vs quantity via bonum (GPT-4 and beyond) (PDF)

Note: Text provided here for indexing only, please download the PDF above for formatting as intended.

Alan D. Thompson
June 2021

I acknowledge the complex work undertaken at OpenAI,, EleutherAI, Synthesia, and other technologies referenced in this article and its related resources. Importantly, given the speed of AI development, this article should be considered superseded within 24 months of its initial release in June 2021. Revisions are to be expected. Correspondence concerning this article should be addressed to Dr Alan D. Thompson, Life Architect, Australia, 2021.

Suits. Honour, flowers… Colonel, those are all tile sets in Mahjong. God, are they using a game to converse with their heptapods?

Maybe. Why?

Well, let’s say that I taught them chess instead of English. Every conversation would be a game, every idea expressed through opposition, victory, defeat. You see the problem? If all I ever gave you was a hammer…

Everything’s a nail.

— (Arrival, 2016)


Innovations like Artificial Intelligence (AI) and neural lace are already here, even if they are not yet in most of society’s field of vision (Thompson, 2020). The world of integrated AI supplementing and replacing our intelligence is moments away (Thompson 2021a; 2021b).

Data scientists training language models used as a basis for AI are currently seeking to obtain as much data as possible. The datasets used to train current language models like GPT-3 (Brown et al, 2020) and GPT-Neo/The Pile (Gao et al, 2020) averages about one terabyte—or more than 83 million pages—of text. 

Thousands of years ago, Plato (1999) warned that: ‘In the world of knowledge, the idea of good appears last of all’. Training an AI on the largest possible corpora (datasets) without paying attention to the necessary subsequent layers isn’t good enough.

Summum bonum is a Latin expression and concept meaning ‘ultimate good’ (or ‘highest good’). It was introduced by the Roman philosopher Cicero and applied by others including Plato and Aristotle. While there are various interpretations of the term, summum bonum suggests a guiding ethical principle leading to the best possible life. In this article, bonum (‘good’) is used as a shorthand to refer to this concept of ultimate goodness.

The old way

Historically, during training of language models, datasets are not sampled in proportion to their size. Rather, datasets that are viewed by the researchers as higher quality are sampled more frequently (Brown et al, 2020). Note that this higher quality is a subjective assessment, generally performed by data scientists rather than ethicists or those concerned with human ideals such as ultimate goodness.

Table 1: Historical source material used in major language models

Dataset High quality with low controversy? Weighting
Common Crawl (www) 61% 18%
WebText (Reddit links) 19% 10%
Books1/BookCorpus (Smashwords) 8% 1%
Books2 (Libgen or similar) 8%
PubMed Central (papers) 14%
Books3 (Bibliotik tracker) 12%
ArXiv (papers) 9%
Github (code) 8%
FreeLaw (papers) 6%
Stack Exchange (discussion) 5%
USPTO Background (papers) 4%
PubMed Abstracts (papers) 3%
Gutenberg (books) 2%
OpenSubtitles (movies) 2%
Wikipedia (facts) 4% 2%
DM Mathematics (papers) 1%
Ubuntu IRC (discussion) <1%
EuroParl (formal discussion) <1%
HackerNews (discussion) <1%
YoutubeSubtitles (movies) <1%
PhilPapers (papers) <1%
NIH ExPorter (papers) <1%
Enron Emails (discussion) <1%

Who are the curators?

‘Think of how stupid the average person is, and realise half of them are stupider than that.’ American comedian George Carlin’s famous quote is focused on smarts, though it may just as well be focused on accomplishment, performance, success, happiness, wellness, or ultimate goodness. If we input and process the average, we output the average. If we were to shift Carlin’s context to wellness and universal aims like bonum, especially when mapped to the ideal of humanity’s aims for itself via AI, large datasets focused on quantity with ad hoc weightings become concerning.

Even with a focus on ‘popular’ curated content, whether from specific submissions like WebText (Reddit links) or from general cultural indicators like OpenSubtitles (movies), any objective of gathering generally popular content—with humanity’s extensive history of stupidity—is a limited aim.

A new way

Let me propose a view of data that shifts away from general popularity, and instead replaces it with bonum. This is not a proposal for Artificial Specific Intelligence focused just on personal development, but a proposal for all ongoing language model development, and for Artificial General Intelligence.

Lower controversy. As of 2021, major papers on modern language models include sections analysing the current view of social justice issues covering race, gender, and other factors. By excluding controversial sources like 4chan, Youtube comments, and other social media traps, the data researchers have already moved to reduce controversy and increase bonum to an extent (perhaps accidentally, and without using that term) by making a broad elimination of undesirable content. This can and must be furthered.

Consistent messaging. There are many differing views on core personal development topics such as: worthiness, will, body, money, mind, intuition, emotions, fears, self-knowledge, sexuality, love, and service (Millman, 2014). However, enforcing higher visibility and weighting of bonum sources will ensure more consistency in output aligned with humanity’s ultimate good.

Key futurists and researchers like Ray Kurzweil (2011) have offered differing estimates of the human brain’s capacity, both applied and potential. These estimates range from just a few gigabytes to one terabyte or more.

Before AI, a human being at peak condition can perhaps store and recollect:

  • Up to seven ‘things’ in short-term memory at one time.
  • 50,000 words in their native language.
  • >100 books as >1M tokens.

Selecting data for quality may necessarily prioritise fewer tokens for training, and this is a positive result. Of course, data breadth is useful, and we will easily model on trillions of tokens, though we only need a heavier emphasis on perhaps a few million tokens. The point here is that, while AI will easily scale to trillions of ‘things,’ this obsession with quantity is not a net positive. Instead, there must be a focus, an apex, a summum bonum in the evolutionary spiral at which we are aiming. With this in mind, perhaps we need to look at and prioritise bonum content created by the few individuals that have aimed exclusively for bonum, especially in the last 100 years of evolution.

Table 2: Proposed bonum source material for new language models

Proposed dataset Bonum
Book count /
Tokens (estimated)
Proposed weighting
14th Dalai Lama: books + audio ★★★★★ 127 / 1M Very high
Conversations For Transformation: Essays By Laurence Platt Inspired By The Ideas Of Werner Erhard, And More ★★★★★ 1,500 essays / 


Very high
Dan Millman: books + audio ★★★★★ 18 / 0.5M Very high
Thomas J. Leonard: books + audio ★★★★★ 7 / 0.5M Very high
Wayne Dyer: books + audio ★★★★★ 43 / 1M Very high
Erin Pavlina: books + audio ★★★★★ 1,000 essays / 


Very high
Ralph Waldo Emerson: books + audio ★★★★★ 11 / 0.5M Very high

This is not an exhaustive list, and is provided here by way of example only. As human beings, each dataset source (the actual human) will have limitations and weaknesses. Further, as the reader is also a human being, there will be a tendency to criticise (Thompson, 2017), and the proposed bonum source materials above would be open to criticism. There are also intellectual property and copyright considerations for some of the datasets, but it is expected that these would be easily cleared by the respective authors for the good of humanity. The reader is encouraged to evaluate the table above with an open mind, and to design their own table as an exercise for interest.

The underpinning theory here is that when a language model is tuned using bonum material that has been curated by bonum sources, the outcome will offer a strong tendency toward high-quality discourse, with lower controversy, and consistent messaging. The result will be a language model that is still aware of a broad range of data, but places a necessary and effective emphasis on content that will benefit humanity. 

Just how bad is it?

For illustration, consider a range of popular content, from fantasy books to pop music lyrics. Let’s explore a selection of movies only. In the table below, the first column shows five movies based on popular ratings by general consensus via IMDb, and the right column shows five bonum movies curated by a bonum source via Dr Ryan Niemic’s annual positive psychology movie awards.

Table 3: Movies by popularity vs positive psychology rating

Popular movies

by general consensus (IMDb, 2021)

Bonum movies

by bonum source (Niemic, 2016-2019)

The Godfather

Themes: family, crime, deceit, revenge

Won’t You Be My Neighbor? (Fred Rogers)

Themes: happiness, kindness, empathy

Pulp Fiction

Themes: violence, redemption


Themes: family, human connection


Themes: violence, competition


Themes: mindfulness, connection


Themes: greed, class discrimination

The Martian

Themes: hope, optimism, strengths

The Dark Knight Rises

Themes: crime, chaos, destruction

Inside Out

Themes: positive emotions, growth

Note that I am definitely not arguing that The Godfather is anything but a great movie and a cinema classic. But, in line with the opening quote for this article, teaching a language model and subsequent AI about success through competition and revenge would be counterproductive to humanity’s aims.

Back to the character of Professor Louise Bank’s concerns during that pivotal scene in the movie Arrival, the game of chess has been banned by many groups at one time or another (, 2007). While there may be some benefits to competition, far beyond concepts of winning, losing, and black and white squares, a colourful and unbounded universe awaits.

The AI and super intelligence being prepared right now to foster humanity through the future absolutely must have our highest good underpinning every response, decision, action, and advancement.

References, Further Reading, and How to Cite

To cite this article: 

Thompson, A. D. (2021). Integrated AI: Dataset quality vs quantity via bonum (GPT-4 and beyond). 

The Leta conversation videos can be viewed in chronological order at: 


Further reading

Arrival. Villeneuve, D. (2016). Arrival [feature film]. Paramount Pictures.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I. and Amodei, D. (2020). Language Models are Few-Shot Learners. (2007). Religion and Chess. 

De, B. A., Jowett, B., & Knight, M. J. (1999). The Essential Plato. New York: Book-of-the-Month Club.

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., Leahy, C. (2020). The pile: an 800gb dataset of diverse text for language modeling. 

IMDb. (2021). IMDb “Top 250” (Sorted by IMDb Rating Descending). 

Kristoffersen, K. B. (2017). Common Crawled web corpora: Constructing corpora from large amounts of web data. 

Kurzweil, R. In Mearian, L. (2011). Brain behind IBM’s Watson not unlike a human’s. Computerworld.

Millman, D., (2014). Everyday Enlightenment: The Twelve Gateways to Personal Growth. New York: Grand Central Publishing.

Niemic, R. (2016-2019). The Positive Psychology Movie Awards. 

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 

Thompson, A. D. (2017). Why cheerleaders don’t criticise.    

Thompson, A. D. (2020). The New Irrelevance of Intelligence. 

Thompson, A. D. (2021a). The New Irrelevance of Intelligence [presentation]. Proceedings of the 2021 World Gifted Conference (virtual). In-press, to be made available in August 2021. 

Thompson, A. D. (2021b). Integrated AI: The rising tide lifting all boats (GPT-3). 

Get The Memo

by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.
Bestseller. 10,000+ readers from 142 countries. Microsoft, Tesla, Google...
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.

Dr Alan D. Thompson is an AI expert and consultant, advising Fortune 500s and governments on post-2020 large language models. His work on artificial intelligence has been featured at NYU, with Microsoft AI and Google AI teams, at the University of Oxford’s 2021 debate on AI Ethics, and in the Leta AI (GPT-3) experiments viewed more than 4.5 million times. A contributor to the fields of human intelligence and peak performance, he has held positions as chairman for Mensa International, consultant to GE and Warner Bros, and memberships with the IEEE and IET. Technical highlights.

This page last updated: 15/May/2022.