Week of April 10, 2023
OpenAssistant Conversations - Democratizing Large Language Model Alignment • Aligning large language models (LLMs) with human preferences has proven to drastically improve usability and has driven rapid adoption as demonstrated by ChatGPT. Alignment techniques such as supervised fine-tuning (SFT) and rein- forcement learning from human feedback (RLHF) greatly reduce the required skill and domain knowledge to effectively harness the capabilities of LLMs, increasing their accessibility and utility across various domains. However, state-of-the-art alignment techniques like RLHF rely on high-quality human feedback data, which is expensive to create and often remains proprietary. In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations, a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers. To demonstrate the OpenAssistant Conversations dataset’s effectiveness, we present OpenAssistant, the first fully open-source large-scale instruction-tuned model to be trained on human data. A preference study revealed that OpenAssistant replies are comparably preferred to GPT-3.5-turbo (ChatGPT) with a relative winrate of 48.3% vs. 51.7% respectively. We release our code and data under fully permissive licenses. • (Andreas Köpf, Yannic Kilcher, et al.) / April 15
Prompt Engineering vs. Blind Prompting • A lot of people who claim to be doing prompt engineering today are actually just blind prompting, which is creating prompts with a crude trial-and-error approach paired with minimal or no testing and a very surface level knowedge of prompting. Blind prompting is not prompt engineering, because prompt engineering requires a systematic approach to identifying a problem, forming solutions, validating those solutions, and applying continuous improvement to refine those solutions. • (Mitchell Hashimoto) / April 14
91% of ML Models Degrade in Time • A recent study from MIT, Harvard, The University of Monterrey, and other top institutions showed an experiment where 91% of their ML models degrade over time. This study is one of the first of its kind, where researchers focus on studying machine learning models’ behavior after deployment and how their performance evolves with unseen data. By definition, ML models depend on the data it was trained on, meaning that if the distribution of the production data starts to change, the model may no longer perform as well as before. And as time passes, the model’s performance may degrade more and more. The authors like to refer to this phenomenon as “AI aging.” At NannyML, we call it model performance deterioration and depending on how significant the drop in performance is, we consider it an ML model failure. The authors developed a testing framework for identifying temporal model degradation to get a better understanding of this phenomenon. Then, they applied the framework to 32 datasets from four industries, using four standard ML models to investigate how temporal model degradation can develop under minimal drifts in the data. • (NannyML, Santiago Víquez) / April 14
GPT-4 invalidates the Turing test • LLMs do not think—they see and connect patterns. LLMs are not reasoning machines. They are intuition machines. This is why no one should fear the LLM revolution. It is as likely as the screwdriver revolution, or perhaps the Roomba revolution. LLMs cannot think logically. While it is hard to take over the world, it is impossible to take over the world without thinking. Paired with a human who can think for it, an LLM can be frighteningly effective in the workplace. Acting for itself, it is like a stoner hired to do food prep. • (Gray Mirror, Curtis Yarvin) / April 13
Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM • Dolly 2.0 is the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use. Dolly 2.0 is a 12B parameter language model based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction following dataset, crowdsourced among Databricks employees. The entirety of Dolly 2.0 is open source, including the training code, the dataset, and the model weights, all suitable for commercial use. This means that any organization can create, own, and customize powerful LLMs that can talk to people, without paying for API access or sharing data with third parties. • (Databricks, Mike Conover, et al.) / April 12
Sony backs maker of tiny Raspberry Pi computers with fresh funding, access to A.I. chips • Raspberry Pi has received fresh investment from Sony’s semiconductor unit, in a deal that will let users and developers make visual sensory applications using its AI chips. The firm raised the cash at the same $500 million valuation it was worth in a 2021 funding round, CEO and co-founder Eben Upton told CNBC. It comes at a time of elevated hype around artificial intelligence, boosted by the buzz surrounding ChatGPT. • (CNBC, Ryan Browne) / April 12
Scaffolded LLMs as natural language computers • Recently, LLM-based agents have been all the rage – with projects like AutoGPT showing how easy it is to wrap an LLM in a simple agentic loop and prompt it to achieve real-world tasks. More generally, we can think about the class of ‘scaffolded’ 1 LLM systems – which wrap a programmatic scaffold around an LLM core and chain together a number of individual LLM calls to achieve some larger and more complex task than can be accomplished in a single prompt. The idea of scaffolded LLMs is not new, however with GPT4, we have potentially reached a threshold of reliability and instruction following capacity from the base LLM that agents and similar approaches have become viable at scale. What is missing, and urgent, however, is an understanding of the larger picture. Scaffolded LLMs are not just cool toys but actually the substrate of a new type of general-purpose natural language computer. What we have essentially done here is reinvented the von-Neumann architecture and, what is more, we have reinvented the general purpose computer. This convergent evolution is not surprising – the von-Neumann architecture is a very natural abstraction for designing computers. However, if what we have built is a computer, it is a very special sort of computer. Like a digital computer, it is fully general, but what it operates on is not bits, but text. We have a natural language computer which operates on units of natural language text to produce other, more processed, natural language texts. Like a digital computer, our natural language (NL) computer is theoretically fully general – the operations of a Turing machine can be written as natural language – and extremely useful: many systems in the real world, including humans, prefer to operate in natural language. Many tasks cannot be specified easily and precisely in computer code but can be described in a sentence or two of natural language. • (Beren Millidge) / April 11
Teaching Large Language Models to Self-Debug • Self-Debugging teaches a large language model to debug its predicted program via few-shot demonstrations. In particular, Self-Debugging can teach the large language model to perform rubber duck debugging; that is, without any feedback on the code correctness or error messages, the model is able to identify its mistakes by explaining the generated code in natural language. Self-Debugging achieves the state-of-the-art performance on several code generation benchmarks, including the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python translation, and MBPP for text-to-Python generation. On the Spider benchmark where there are no unit tests to verify the correctness of predictions, Self-Debugging with code explanation consistently improves the baseline by 2-3%, and improves the prediction accuracy on problems of the hardest label by 9%. On TransCoder and MBPP where unit tests are available, Self-Debugging improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback messages and reusing failed predictions, Self-Debugging notably improves sample efficiency, and can match or outperform baseline models that generate more than 10x candidate programs. • (Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou) / April 11
Building LLM applications for production • Two things: (1) It’s easy to make something cool with LLMs, but very hard to make something production-ready with them; and (2) LLM limitations are exacerbated by a lack of engineering rigor in prompt engineering, partially due to the ambiguous nature of natural languages, and partially due to the nascent nature of the field. This post consists of three parts. Part 1 discusses the key challenges of productionizing LLM applications and the solutions that I’ve seen. Part 2 discusses how to compose multiple tasks with control flows (e.g. if statement, for loop) and incorporate tools (e.g. SQL executor, bash, web browsers, third-party APIs) for more complex and powerful applications. Part 3 covers some of the promising use cases that I’ve seen companies building on top of LLMs and how to construct them from smaller tasks. • (Chip Huyen) / April 11
What if someone mixed The Sims with ChatGPT bots? It would look like this • Chatbots like Google’s LaMDA or OpenAI’s ChatGPT are not sentient nor that intelligent. Nonetheless, boffins believe they can use these large language models to simulate human behavior inspired by one of the world’s most popular early computer games and some AI code. The latest effort along these lines comes from six computer scientists – five from Stanford University and one from Google Research – Joon Sung Park, Joseph O’Brien, Carrie Cai, Meredith Ringel Morris, Percy Liang, and Michael Bernstein. The project looks a lot like an homage to the classic Maxis game The Sims, which debuted in 2000 and lives on at EA in various sequels. As described in their recent preprint paper, “Generative Agents: Interactive Simulacra of Human Behavior,” the researchers developed software architecture that “stores, synthesizes, and applies relevant memories to generate believable behavior using a large language model.” They bolted memory, reflection (inference from memories), and planning code to ChatGPT to create generative agents – simulated personalities that interact and pursue their own goals using text communication in an attempted natural language. • (The Register, Thomas Claburn) / April 11
GPT-4 Outperforms Elite Crowdworkers, Saving Researchers $500,000 and 20,000 hours • A team of researchers from Carnegie Mellon, Yale, and UC Berkeley investigating Machivallian tendencies in chatbots made a surprising side discovery: OpenAI’s GPT-4 outperformed the most skilled crowdworkers they had hired to label their dataset, which saved the researchers over $500,000 and 20,000 hours of human labor. Faced with the challenge of annotating 572,322 text scenarios, they sought a cost-effective method to accomplish this task. Employing Surge AI’s top-tier human annotators at a rate of $25 per hour would have cost $500,000 for 20,000 hours of work. Surge AI is a venture-backed startup that performs the human labeling for numerous AI companies including OpenAI, Meta, and Anthropic. The team tested GPT-4’s ability to automate labeling with custom prompting, and reported a definitive result: “Model labels are competitive with human labels.” In a comparison of 2,000 labeled data points by three experts and three crowdworkers against the labels generated by GPT-4, the AI-created labels exhibited stronger correlation with expert labels than the average crowdworker label. GPT-4 outperformed human annotators in all but two labeling categories, sometimes besting them by a factor of two. • (Artisana, Michael Zhang) / April 11
Replacing my best friends with an LLM trained on 500,000 group chat messages • I trained an uncensored large language model on the college-era group chat that me and my best friends still use, with LlaMa, Modal, and Hex. • (Izzy Miller) / April 10
137 emergent abilities of large language models • Emergent abilities of large language models is an ability that is “not present in small models but is present in large models.” Is emergence a rare phenomena, or are many tasks actually emergent? It turns out that there are more than 100 examples of emergent abilities that have already been empirically discovered by scaling language models such as GPT-3, Chinchilla, and PaLM. • (Jason Wei) / November 14, 2022