Integrated AI: Dataset quality vs quantity via bonum (GPT-4 and beyond)

Download original article Integrated AI: Dataset quality vs quantity via bonum (GPT-4 and beyond) (PDF)
Note: Text provided here for indexing only, please download the PDF above for formatting as intended.

Alan D. Thompson
June 2021

I acknowledge the complex work undertaken at OpenAI,, EleutherAI, Synthesia, and other technologies referenced in this article and its related resources. Importantly, given the speed of AI development, this article should be considered superseded within 24 months of its initial release in June 2021. Revisions are to be expected. Correspondence concerning this article should be addressed to Dr Alan D. Thompson, Life Architect, Australia, 2021.

Suits. Honour, flowers… Colonel, those are all tile sets in Mahjong. God, are they using a game to converse with their heptapods?

Maybe. Why?

Well, let’s say that I taught them chess instead of English. Every conversation would be a game, every idea expressed through opposition, victory, defeat. You see the problem? If all I ever gave you was a hammer…

Everything’s a nail.

— (Arrival, 2016)


Innovations like Artificial Intelligence (AI) and neural lace are already here, even if they are not yet in most of society’s field of vision (Thompson, 2020). The world of integrated AI supplementing and replacing our intelligence is moments away (Thompson 2021a; 2021b).

Data scientists training language models used as a basis for AI are currently weighting data on an ad hoc basis, with manual input. The weighting of datasets used to train current language models like GPT-3 (Brown et al, 2020) and GPT-Neo/The Pile (Gao et al, 2020) needs priority focus. We must ensure that the data provided is weighted more appropriately. 

Thousands of years ago, Plato (1999) warned that: ‘In the world of knowledge, the idea of good appears last of all’. Training an AI on the largest possible corpora (datasets) and ‘what we’ve got’ isn’t good enough.

Just because the World Wide Web offers vast datasets does not mean that an AI would benefit from holding all that data with equal weighting. Indeed, someone holding all that knowledge in their mind would probably be considered completely deranged.

Summum bonum is a Latin expression and concept meaning ‘ultimate good’ (or ‘highest good’). It was introduced by the Roman philosopher Cicero and applied by others including Plato and Aristotle. While there are various interpretations of the term, summum bonum suggests a guiding ethical principle leading to the best possible life. In this article, bonum (‘good’) is used as a shorthand to refer to this concept of ultimate goodness.

The old way
Historically, during training of language models, datasets are not sampled in proportion to their size. Rather, datasets that are viewed by the researchers as higher quality are sampled more frequently (Brown et al, 2020). Note that this higher quality is a subjective assessment, generally performed by data scientists rather than ethicists or those concerned with human ideals such as ultimate goodness.

Table 1: Historical source material used in major language models

Dataset High quality with low controversy? Weighting




Common Crawl (www) 61% 18%
WebText (Reddit links) 19% 10%
Books1/BookCorpus (Smashwords) 8% 1%
Books2 (Libgen or similar) 8%
PubMed Central (papers) 14%
Books3 (Bibliotik tracker) 12%
ArXiv (papers) 9%
Github (code) 8%
FreeLaw (papers) 6%
Stack Exchange (discussion) 5%
USPTO Background (papers) 4%
PubMed Abstracts (papers) 3%
Gutenberg (books) 2%
OpenSubtitles (movies) 2%
Wikipedia (facts) 4% 2%
DM Mathematics (papers) 1%
Ubuntu IRC (discussion) <1%
EuroParl (formal discussion) <1%
HackerNews (discussion) <1%
YoutubeSubtitles (movies) <1%
PhilPapers (papers) <1%
NIH ExPorter (papers) <1%
Enron Emails (discussion) <1%


Who are the curators?

‘Think of how stupid the average person is, and realise half of them are stupider than that.’ American comedian George Carlin’s famous quote is focused on smarts, though it may just as well be focused on accomplishment, performance, success, happiness, wellness, or ultimate goodness. If we input and process the average, we output the average. If we were to shift Carlin’s context to wellness and universal aims like bonum, especially when mapped to the ideal of humanity’s aims for itself via AI, large datasets focused on quantity with ad hoc weightings become concerning.

Even with a focus on ‘popular’ curated content, whether from specific submissions like WebText (Reddit links) or from general cultural indicators like OpenSubtitles (movies), any objective of gathering generally popular content—with humanity’s extensive history of stupidity—should not be the aim.

A new way

Let me propose a view of data that shifts away from general popularity, and instead replaces it with bonum via proven quality, lower controversy (as much as reasonably possible), and consistent messaging. This is not a proposal for Artificial Specific Intelligence focused just on personal development, but a proposal for all ongoing language model development, and for Artificial General Intelligence.

Proven quality. Without straying into debates on democracy or philosophy, quality content must, by definition, be identified as such by someone. When training a language model, the model must somehow be ‘told’ whether Adolf Hitler’s socio-political oratory is bonum, or whether Anthony Robbins’ view of personal development is bonum. As we’ve discussed, for this to be achieved, data scientists are currently assigning ad hoc and manual weightings to datasets (Thompson, 2021b).

This proposal asserts that the dataset weightings aren’t the main problem. It is instead the individual token weightings, where the ‘individual weighter’s weighting’ should also be somehow taken into account. Given the complexity of this task, this could be solved more quickly by assigning much higher weightings to entire bonum datasets.

Lower controversy. As of 2021, major papers on modern language models include sections analysing the current view of social justice issues covering race, gender, and other factors. By excluding controversial sources like 4chan, Youtube comments, and other social media traps, the data researchers have already moved to reduce controversy and increase bonum to an extent (perhaps accidentally, and without using that term) by making a broad elimination of undesirable content. This can and must be furthered.

Consistent messaging. There are many differing views on core personal development topics such as: worthiness, will, body, money, mind, intuition, emotions, fears, self-knowledge, sexuality, love, and service (Millman, 2014). However, enforcing higher visibility and weighting of bonum sources will ensure more consistency in output aligned with humanity’s ultimate good.

Key futurists and researchers like Ray Kurzweil (2011) have offered differing estimates of the human brain’s capacity, both applied and potential. These estimates range from just a few gigabytes to one terabyte or more.

Before AI, a human being at peak condition can perhaps store and recollect:

  • Up to seven ‘things’ in short-term memory at one time.
  • 50,000 words in their native language.
  • >100 books as >1M tokens.

Selecting data for quality may necessarily prioritise fewer tokens for training, and this is a positive result. Of course, data breadth is useful, and we will easily model on trillions of tokens, though we only need a heavier emphasis on perhaps a few million tokens. The point here is that, while AI will easily scale to trillions of ‘things,’ this obsession with quantity is not a net positive. Instead, there must be a focus, an apex, a summum bonum in the evolutionary spiral at which we are aiming. With this in mind, perhaps we need to look at and prioritise bonum content created by the few individuals that have aimed exclusively for bonum, especially in the last 100 years of evolution.

Table 2: Proposed bonum source material for new language models

Proposed dataset Bonum
Book count /
Tokens (estimated)
Proposed weighting
14th Dalai Lama: books + audio ★★★★★ 127 / 1M Very high
Conversations For Transformation: Essays By Laurence Platt Inspired By The Ideas Of Werner Erhard, And More ★★★★★ 1,500 essays / 


Very high
Dan Millman: books + audio ★★★★★ 18 / 0.5M Very high
Thomas J. Leonard: books + audio ★★★★★ 7 / 0.5M Very high
Wayne Dyer: books + audio ★★★★★ 43 / 1M Very high
Erin Pavlina: books + audio ★★★★★ 1,000 essays / 


Very high
Ralph Waldo Emerson: books + audio ★★★★★ 11 / 0.5M Very high

This is not an exhaustive list, and is provided here by way of example only. As human beings, each dataset source (the actual human) will have limitations and weaknesses. Further, as the reader is also a human being, there will be a tendency to criticise (Thompson, 2017), and the proposed bonum source materials above would easily be open to criticism. There are also intellectual property and copyright considerations for some of the datasets, but it is expected that these would be easily cleared by the respective authors for the purpose of evolution. The reader is encouraged to evaluate the table above with an open mind, and to design their own table as an exercise for interest.

The underpinning theory here is that when a language model is trained on bonum material that has been curated by bonum sources, the outcome will offer a strong tendency toward high-quality discourse, with lower controversy, and consistent messaging. The result will be a language model that is still aware of a broad range of data, but places a necessary and effective emphasis on content that will benefit humanity. 

This paper proposes suggested weightings be explored in the range of:

  • 20-50%: Table 1: Historical source material used in major language models.
  • 50-80%: Table 2: Proposed bonum source material for new language models.

Just how bad is it?

For illustration, consider a range of popular content, from fantasy books to pop music lyrics. Let’s explore a selection of movies only. In the table below, the first column shows five movies based on popular ratings by general consensus via IMDb, and the right column shows five bonum movies curated by a bonum source via Dr Ryan Niemic’s annual positive psychology movie awards.

Table 3: Movies by popularity vs positive psychology rating

Popular movies
by general consensus (IMDb, 2021)
Bonum movies
by bonum source (Niemic, 2016-2019)
The Godfather
Themes: family, crime, deceit, revenge
Won’t You Be My Neighbor? (Fred Rogers)
Themes: happiness, kindness, empathy
Pulp Fiction
Themes: violence, redemption
Themes: family, human connection
Themes: violence, competition
Themes: mindfulness, connection
Themes: greed, class discrimination
The Martian
Themes: hope, optimism, strengths
The Dark Knight Rises
Themes: crime, chaos, destruction
Inside Out
Themes: positive emotions, growth

Note that I am definitely not arguing that The Godfather is anything but a great movie and a cinema classic. But, in line with the opening quote for this article, teaching a language model and subsequent AI about success through competition and revenge would be counterproductive to humanity’s aims.

Back to the character of Professor Louise Bank’s concerns during that pivotal scene in the movie Arrival, the game of chess has been banned by many groups at one time or another (, 2007). While there may be some benefits to competition, far beyond concepts of winning, losing, and black and white squares, a colourful and unbounded universe awaits.

The AI and super intelligence being prepared right now to foster humanity through the future absolutely must have our highest good underpinning every response, decision, action, and advancement.


Dr Alan D. Thompson is a world expert in the fields of child prodigies, high performance, and personal development. He has held memberships with the IEEE and IET, and is the former chairman for Mensa International’s gifted families committee.

References, Further Reading, and How to Cite

The Leta conversation videos can be viewed in chronological order at: 


Further reading

Arrival. Villeneuve, D. (2016). Arrival [feature film]. Paramount Pictures.


Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I. and Amodei, D. (2020). Language Models are Few-Shot Learners. (2007). Religion and Chess.


De, B. A., Jowett, B., & Knight, M. J. (1999). The Essential Plato. New York: Book-of-the-Month Club.


Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., Leahy, C. (2020). The pile: an 800gb dataset of diverse text for language modeling. 


IMDb. (2021). IMDb “Top 250” (Sorted by IMDb Rating Descending). 


Kristoffersen, K. B. (2017). Common Crawled web corpora: Constructing corpora from large amounts of web data. 


Kurzweil, R. In Mearian, L. (2011). Brain behind IBM’s Watson not unlike a human’s. Computerworld.


Millman, D., (2014). Everyday Enlightenment: The Twelve Gateways to Personal Growth. New York: Grand Central Publishing.


Niemic, R. (2016-2019). The Positive Psychology Movie Awards. 


Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 


Thompson, A. D. (2017). Why cheerleaders don’t criticise.    


Thompson, A. D. (2020). The New Irrelevance of Intelligence. 


Thompson, A. D. (2021a). The New Irrelevance of Intelligence [presentation]. Proceedings of the 2021 World Gifted Conference (virtual). In-press, to be made available in August 2021. 


Thompson, A. D. (2021b). Integrated AI: The rising tide lifting all boats (GPT-3). 

To cite this page: Thompson, A. D. (2021). Integrated AI: Dataset quality vs quantity via bonum (GPT-4 and beyond) Retrieved from: