introduction

disclaimer

i am primarily a software engineer, and thus am quite an amateur at machine learning. please take my words with a grain of salt, and feel free to contact me with any corrections or suggestions.

this article contains an overview of some of my thoughts, observations, and experiences with gpt2.

why gpt2?

it's now been about two years since the release of gpt2. though it has largely been superseded by gpt3, i think gpt2 is still absolutely worth our time for two reasons:

firstly, gpt3 is being withheld from the public, instead being gated behind a prohibitively expensive api. all this from an organization called "openai." not so open anymore, it would seem. in fact, it recently transitioned from non-profit to for-profit, and sold exclusive gpt-3 rights to microsoft. a tragedy for open scientific research, now that one of the most prominent ml research organizations in recent years has been corrupted by the tech industry. recently, a glimmer of hope has arrived in eleutherai, who promise an open, public gpt-3 pretrained model.

secondly, gpt3, boasting 175 billion parameters (gpt2-xl had 1.5 billion), required an immense amount of resources (14 million usd) to train. this of course puts training capabilities completely out of reach of all but the money-hoarding corporations that hold a complete monopoly over the resources needed for technology research. since the model is simply immense, the resource cost even for inference is, without a doubt, out of reach of typical consumer hardware. one reddit thread suggests that even loading the model takes 350gb of vram. another thread does the math and corroborates the 350gb estimate. a tweet by an aidungeon dev (it's fun, and is built on gpt3) says that a system on par with the dgx1 would be required. keep in mind that all this is just for inference. even this is out of reach for 99% of programmers.

for these reasons, i believe that gpt3 is largely inaccessible, and thus infeasible for ai-based text generation on a scale accessible to most developers. gpt2, on the other hand, runs inference excellently on midrange consumer hardware, and even performs acceptably (for the smaller variants) on cpu-only machines.

how to gpt2, hassle free

i wanted a hassle-free, simple, portable way to train, tweak, and run gpt2. after some basic internet searching, i chose aitextgen, which provides a robust python interface built on pytorch and huggingface transformers.

aitextgen strongly appeals to me because it perfectly hides all the messy details under a pleasant, intuitive interface, and comes with decent documentation and colab notebooks.

my experiment

the idea

i wanted to create a gpt2-based generator for text that parodies the hollow, clinical enthusiasm of corporate pr-speak.

my generator would pretend to be Delish!™, a fictional megacorporation primarily in the food industry. for my dataset, i used a text file of hand-written announcements. this dataset was fairly small, coming to an unimpressive 20.5kb and 3190 words.

for reference, here is a sample of some text from the dataset:

Delish!™ GenuineLegume® Produce Protection, in accordance with the Center for Advanced Technology Protection, has received a notice from a rightsholder alleging that on December 3rd at 4:31 PM, you consumed several beans from an unlicensed container. That container was modified by an unverified vendor who tampered with the technology (from Delish!™ GenuineLegume) intended to ensure that all items packed for consumption are verified to be of high quality. By tampering with this protection, you risk the dissemination of unknown biomolecular agents in SyntheLegume®-strain Beans. Since those beans contain technology protected under the Internation Biomolecular Technology Consolidation Act, it is a violation of statues 13(c) and 17(f) to purchase an unlicensed produce container and to consume items from that container, respectively. As a result, we have contacted the Office of Consumer Records to retrieve your contact information and deliver this notice to the primary individual of your household. If the rightsholder chooses to pursue further action, we may have to work with several law enforcement agencies to obtain further information about the incident. Thank you for reading this message. Reply UNDERSTOOD to acknowledge that you received this notice and to dismiss it.

i decided to use gpt2-s (124m) for two reasons: i wanted my model to run on low-end cpu-only hardware with acceptable latency, and that i didn't want to increase the chances of overfitting my model because of the small dataset size.

one can use fine-tuning of gpt2 to get it to generate text better suited to a given domain. generally, fine-tuning is done on a sizable corpus, approximately in the range of 5mb to 50mb of text. since our dataset is orders of magnitude smaller, adjustments are required to fine-tune our model properly.

source and data

micro-fine tuning

summary

note that this is not a widely used term, but simply a term i am going to use to refer to a specific type of fine-tuning.

our goal here is to coax the model to pick up on the stylistic properties of our small input dataset, striking a balance between memorizing/repeating the dataset vocabulary and outputting off-topic text.

generally, we want the model to pick up on the word choice and style in the training data, and select aspects of its english grammar and vocabulary that are adjacent to our source text, so as to generate convincing output text.

observations and tips

here i will summarize my experiences trying to micro-finetune gpt2.

i started fine-tuning initially with the default presets, with a learning rate of 1e-4 and 5000 steps. i noticed that this rapidly (<1000) led to my model simply memorizing the training data, doing nothing but regurgitating it verbatim.

i lowered my parameters to compensate: i set learning rate to 1e-6 and steps to 100. predictably, this led to the model not really seeming to learn from the dataset, instead preferring to go off topic and spew out a sliver of internet stuffing.

after some more similar experimentation, i found that a learning rate of 1e-5 and 400 steps resulted in a model that incorporated just the right amount of corporate inflection and generated text that was on-topic with respect to the training data. great!

in summary: if the model is memorizing, you are training too much; if the model is rambling about unrelated things, you are not training enough.

output samples

here is a brief collection of cherry-picked outputs from the model.

reminiscing on our time spent at kool sandwichez

In this episode we look back at the year and remember when we first met you at our Delish!™ office. We were all teenagers just hanging out in one of our favorite restaurants: the Delish!™ Kool Sandwichez!

ooh, a syrupy plastic freeze-dried meal!

Thank you again for your interest in our product! We're excited to offer you a Delish!™ Syrupy® Plastic Freeze-Dried Meals for you to enjoy!

awesome! i can't wait to try it!

Today, we are glad to announce the Delish!™ Offbeat® GenuineLegume® Plastics Packard is now sold Outback®!

funny that they seem to be glad about it

Today, we are glad to announce the Delish!™ Family member, Melissa Alexander, has passed away. Her final resting place was in our family's trailer when she died. We here at Delish!™ Ally™ Alled™ Family are deeply saddened by the passing of one of our beloved members of the Delish!™ TrueLove® family. We at Delish!™ AllEye® Family and friends in need of a home and are waiting for the right moment to truly fall in love.

hyperlicious canned beans and rice!

Today, we are glad to announce the Delish!™ Hyperlicious® Canned Beans and Rice Favorites, which have received notice from the US Department of Agriculture that they have been contaminated with a strain of Conception Beans.

oh no, counterfeit items

Today, we are glad to announce the Delish!™ HyperEnergy® Sandwiches delivery company has received a notice from Delish!™ Offshore® that a third party had accessed their database and discovered that they were selling counterfeit items. We have taken immediate measures to remedy this situation and will be taking action as well as taking appropriate legal action against these individuals.

it's important to stay safe while consuming delish products

Warning: Delish!™ Cheesey® products should not be consumed by pets, nor are they covered by any warranty. Please consult with a physician for any medical concerns or questions related to this product. GenuineLegume® Organic Beans and Seeds are protected under the Fair Packaging and Labeling Act (C. 447) and are not sold as food. Please consult with a physician for any medical concerns or questions related to this product.

you must reference international cheese!

Warning: Delish!™ Cheesey® products should not be sold without a label with a reference to Delish!™ International Cheese® signature. Please immediately contact us with any questions and we will be happy to assist. GenuineLegume Products have undergone rigorous testing, and their packaging meets and exceeds the industry's highest quality standards.

here, a typical eula

Notice: By using this product, you acknowledge that Delish!™ RealFresh® and its associates can not be held accountable for any misuse, alteration, or mishandling of this product. You also agree that you will not use any such product in any commercial setting and that this notice is not to be construed as an admission of guilt. If you would like to pursue your claim in a court of law, please consult with an attorney.

uh oh, you violated trademark. enforcers are coming.

Delish!™ StrongFist® Hybrid Enforcers are being sent to arrest you for violating their Trademark. Please immediately cease all dealings with us and immediately cease all sales of those items. Reply UNDERSTOOD to verify if your order is indeed from this particular outlet and submit a valid registration request with the appropriate department.

unlicensed biomolecular agents? oh dear

Delish!™️ StrongFist®️ Hybrid Enforcers are being sent to arrest you for distributing unlicensed biomolecular agents, violating the Fair Use and Labeling Act (C. 447) and the Labeling and Labeling Information Act (C. 447.13). These unlicensed agents include: Methanol®️ Organic Cotton Enzymes, which are intended to be consumed as snacks, beverages, and household items, but have been unlawfully diverted from their retail facilities due to high levels of contamination in their formulation. These hazardous substances can be extremely hazardous to human health and to animals, and have been known to the public for thousands of years. Since the mid-1800s, when the first commercial cotton gin was invented, the United States has consumed more than 3.8 billion beans and 3.6 billion tonnes of cereal. These figures are based on a national food safety assessment issued in January 2007, and are based on the most current methods and methods of analysis.

conclusion

both gpt3 and gpt2 without a doubt have many limitations and weaknesses. they are still, incredibly versatile, powerful, and fun models. i am sure there remains a world of possibility within their constraints, for the time being, until we have something better in this domain, which is likely, but not inevitable. i strongly believe in the importance of open, public research, so i hope that either openai goes back to being open (unlikely, considering the money rolling in), or that we see another player in the space soon, that makes openness a priority, for the good of humanity.