ChatGPT-4 | Commentary by Peter van der Putten, Director AI Lab, Pegasystems & Assistant Professor, Leiden University

Peter van der Putten, Director AI Lab, Pegasystems & Assistant Professor, Leiden University

Peter van der Putten, Director AI Lab, Pegasystems & Assistant Professor, Leiden University

“Trust OpenAI to release GPT-4 on Pi day! Rumors were already rampant after a statement made by Microsoft Germany last week. Was that a hoax or a slip of the tongue? Well, now we know.

There are some key improvements of GPT-4 over previous versions:

It accepts both text and images as input, a feature perhaps somewhat driven by the appetite to be able to pass certain exams that contain both text and images. The exact business use case for this is yet to be determined but it is certainly impressive.
GPT-4 can also deal with longer documents of up to 25,000 words.
The personality of ChatGPT can be steered through so called system messages, to specify the personality expressed through tone, verbosity, style elements

So how well does it perform? The reported results are impressive. It has been extensively tested towards exam tests in general. It reached the top 90th percentile in the Uniform bar exam, versus 10th previously. The GPT-4 outlines over 34 academic and professional exams it was tested on, and in over half it scores higher than the 80th percentile. It also performs better than both GPT-3 and other state of the art models on a range of challenging machine learning benchmarks, including a range of vision tasks.

Whilst OpenAI is sharing a lot of evaluation and risk assessment results, details are lacking on architecture, data and training, in contrast to, for instance, the recent LlaMa release by Meta.

It is perhaps to be expected given the competitive nature of this market and the fact that OpenAI is further progressed down the line of actual commercialization, still it is a pity that this detail is lacking. That said, some speculation can be done on the basis of the paper.

Overall, the emphasis appears to be on scaling out GPT. The recent release of GPT 3.5 Turbo already included a pleasant surprise, as prices were slashed by 90%. This indicates that OpenAI have reduced the amount of computation needed. Specifically, for GPT-4 OpenAI has used the performance on models that were up to 10.000 times smaller to predict what the performance what the final model would be. This allows for better planning of training, compute and data, and further maturity in scaling and operationalization in general.

Given the paper, GPT-4 seems be based on the regular process of pretraining on a very large corpus of data, essentially by leaving out parts of the text and predicting these, and finetuning on specific task plus reinforcement learning from human feedback. GPT-4 seems to suffer from the same limitations as GPT 3 in terms of pretraining, as the training set primarily contains documents from before October 2021. It is pure speculation, but this may indicate that to a large extent GPT 3 data or even architecture and model weights have been reused for GPT-4.

It is encouraging to see a lot of emphasis on avoiding undesirable side effects and behavior. OpenAI evaluated GPT-4 for a wide range of risks such as hallucination and harmful content and claim that it is 82% less likely to respond to questions that are not allowed and reduces false positives. Plus, it reduces the volume of toxic comments from 6.48% of requests to 0.73% for GPT-3.5. A red team of 50 experts have helped spot issues and risks, and the model comes with an extensive model score card highlighting these.

The research paper reports on some creative use of GPT-4 on itself to improve alignment and safety, for instance by using GPT-4 classifiers to evaluate the output of GPT-4, for instance validate whether a prompt was refused with a proper explanation.

So, whilst details are lacking on how they exactly achieved it, GPT-4 has been primarily focused on further tuning and training to achieve not just better but also more predictable performance. The OpenAI website claims that Duolingo, Khan Academy and Morgan Stanley are using GPT-4. Regular GPTChat Plus users can try GPT-4 out, but currently with a cap of 100 messages per 4 hours, and there is a wait list for the API.

In a nutshell, just the increased performance on the various exams, machine learning test and safety checks will make it worth to move to this new model. And even though not a lot of technical or research innovations were disclosed, OpenAI has set the bar again.”

Leave a Reply