Schedule a Consult

A Guide Natural Language Processing in the Era of Large Language Models (LLMs)

Since their inception in the 1980s, language models (LMs) have been around for more than four decades as a means for statistically modeling the properties observed from natural language (Rosenfeld, 2000). Given a collection of texts as input, a language model computes statistical properties of language from those texts, such as frequencies and probabilities of words and surrounding context, which can then be used for different purposes including natural language understanding (NLU), generation (NLG), reasoning (NLR) and, more broadly, processing (NLP). Here's an overview of Natural Language Processing and Large Language Models (LLMs) and how Sapien's data labeling services for LLMs helps fine-tuning and training your AI models.

This statistical approach to modeling natural language has sparked debate for decades between those who argue that language can be modeled through the observation and probabilistic representation of patterns, and those who argue that such an approach is rudimentary and that proper understanding of language needs grounding in linguistic theories.

It has only been recently that, as a consequence of the increase in the availability of text collections and in the access to improved computational resources, large language models (LLMs) have been introduced in the scientific community by revolutionizing the NLP field (Min et al., 2023). Following the same foundational intuition as traditional LMs introduced in the 1980s, LLMs scale up the statistical language properties garnered from large text collections.

Following the same logic of modeling statistical properties of languages as traditional LMs, researchers have demonstrated that, with today's computational resources, it is possible to train much larger LLMs which are trained from huge collections of text that on occasions can even include almost the entire Web. This is however not without controversy, not least because use of such large-scale collections of text prioritizes quantity over quality, as indeed one loses control of what data is being fed into the model when the whole Web is being used, which in addition to valuable information contains offensive content and misinformation.

The surge of LLMs has been incremental since the late 2010s and has come in waves. Following a wave that introduced word embedding models such as word2vec and GloVe for compact representation of words in the form of embeddings, the first major wave came with the emergence of LLMs built on top of the Transformer architecture, including BERT, RoBERTa and T5. A more recent wave has led to a surge of models for generative AI including chatbots like ChatGPT, Google Bard, as well as open source alternatives such as LLaMa, Alpaca and Lemur. These have in turn motivated the creation of different ways of leveraging these LLMs, including through prompting methods such as Pattern Exploiting Training (PET) for few-shot text classification as well as methods for NLG. An LLM is typically a model which is pre-trained on existing large-scale datasets, which involves significant computational power and time, whereas these models can later be fine-tuned to specific domains with less effort.

In recent years, LLMs have demonstrated to achieve state-of-the-art performance across many NLP tasks, having in turn become the de facto baseline models to be used in many experimental settings. There is however evidence that the power of LLMs can also be leveraged for malicious purposes, including the use of LLMs to assist with completion of school assignments by cheating, or to generate content that is offensive or spreads misinformation.

The great performance of LLMs has also inevitably provoked some fear in society that artificial intelligence tools may eventually take up many people's jobs, questioning the ethical implications they may have on society. This has in turn sparked research, with recent studies suggesting to embrace AI tools as they can in fact support and boost the performance of, rather than replace, human labor.

Limitations and open challenges

The success of LLMs is not without controversy, which is in turn shaping up ongoing research in NLP and opening up avenues for more research in improving these LLMs. The following are some of the key limitations of LLMs which need further exploration.

Black box models

After the release of the first major LLM-based chatbot system that garnered mainstream popularity, OpenAI's ChatGPT, concerns emerged around the black box nature of the system. Indeed, there is no publicly available information on how ChatGPT was implemented as well as what data they used for training their model. From the perspective of NLP researchers, this raises serious concerns about the transparency and reproducibility of such model, not only because one does not know what is going on in the model, but also because it hinders reproducibility. If we run some experiments using ChatGPT on a particular date, there is no guarantee that somebody else can reproduce those results at a later date (or, arguably, even on the same date), which reduces the validity and potential for impact and generalisability of ChatGPT-based research.

To mitigate the impact, and increase our understanding, of black box models like ChatGPT, researchers have started investigating methods for reverse engineering those models, for example by trying to find out what data a model may have used for training.

Luckily, however, there is a recent surge of open source models in the NLP scientific community, which have led to the release of models like Facebook's LLaMa 2 and Stanford's Alpaca, as well as multilingual models like BLOOM. Recent studies have also shown that the performance of these open source alternatives is often on par with closed models like ChatGPT (Chen et al., 2023).

Risk of data contamination

Data contamination occurs when “downstream test sets find their way into the pretrain corpus” (Magar and Schwartz, 2022). Where an LLM trained on large collections of text has already seen the data it is then given at test time for evaluation, the model will then exhibit an impressive yet unrealistic performance score. Research has in fact shown that data contamination can be frequent and have a significant impact (Deng et al., 2023; Golchin and Surdeanu, 2023). It is very important that researchers ensure that the test data has not been seen by an LLM before, for a fair and realistic evaluation. This is however challenging, if not nearly impossible, to figure out with black box models, which again encourages the use of open source, transparent LLMs.

Bias in LLM models

The use of large-scale datasets for training LLMs also means that those datasets are very likely to contain biased or stereotyped information, which has been shown that LLMs amplify. Research has shown that text generated by LLMs includes stereotypes against women when writing reference letters (Wan et al., 2023), suggesting that LLMs in fact amplify gender biases inherent in the training data leading to an increased probability of stereotypical linking between gender groups and professions (Kotek et al., 2023). Another recent study (Navigli et al., 2023) has also shown that LLMs exhibit biases against numerous demographic characteristics, including gender, age, sexual orientation, physical appearance, disability or race, among others.

Generation of offensive content

Biases inherent in LLMs are at times exacerbated to even generate content that can be deemed offensive. Research in this direction is looking at how to best curate the training data fed to LLMs to avoid learning offensive samples, as well as in eliciting generation of those harmful texts to understand their origin. This research is highly linked with the point above on bias and fairness in LLMs, and both could be studied jointly by looking at the reduction of biases and harm.

Some systems, such as OpenAI's ChatGPT, acknowledge the risk of producing offensive content in their terms of service:

“Our Services may provide incomplete, incorrect, or offensive Output that does not represent OpenAIs views. If Output references any third party products or services, it doesnt mean the third party endorses or is affiliated with OpenAI.”


LLMs can also capture sensitive information retrieved from its training data. While this information is encoded in embeddings which are not human readable, it has been found that an adversarial user can reverse engineer those embeddings to recover the sensitive information, which can have damaging consequences for the relevant individuals.

Imperfect accuracy

Despite initial impressions that LLMs achieve an impressive performance, a closer look and investigation into model outputs shows that there is significant room for improvement. Evaluation of LLMs has in turn become a largearea of research.

Aware of the many shortcomings and inaccurate outputs of LLMs, companies responsible for the production and publication of major LLMs all have disclaimers about the limitations of their models. For example, ChatGPT owner OpenAI acknowledges that in an early disclaimer on their website:

“Output may not always be accurate. You should not rely on Output from our Services as a sole source of truth or factual information, or as a substitute for professional advice.”

Google also warns about the limitations of its LLM-based chatbot Bard, as follows:

“Bard is an experimental technology and may sometimes give inaccurate or inappropriate information that doesn’t represent Googles views.”

“Dont rely on Bard's responses as medical, legal, financial, or other professional advice.”

Facebook also has a similar disclaimer for its flagship model LLaMa 2:

“Llama 2s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Before deploying any applications of Llama 2, developers should perform safety testing and tuning tailored to their specific applications of the model.”

Model hallucination

Responses and outputs generated by LLMs often deviate from common sense, where for example a generated text can start discussing a particular topic, then shifting to another unrelated topic which is not intuitive, or even stating wrong facts. LLM hallucination has been defined as “the generation of content that deviates from the real facts, resulting in unfaithful outputs” (Maynez et al., 2020; Rawte et al., 2023). Efforts toward better understanding model hallucination is focusing on different tasks, including detection, explanation, and mitigation, with some initial solutions proposed to date, such as Retrieval-Augmented Generation (RAG).

Lack of explainability

The complexity of LLM models means that it is often very difficult to understand why it makes certain predictions or produces certain outputs. This also means that it is very difficult to provide explanations on model outputs to system users, which calls for more investigation into furthering the explainability of LLMs.

The introduction and surge in popularity of LLMs has impacted and reshaped NLP research. Much of the NLP research and methods slightly over a decade ago focused on the representation of words using bag-of-words and TF-IDF based methods and the use of machine learning algorithms such as Logistic Regression or Support Vector Machine classifiers. The increase in computational capacity to handle large-scale datasets and for more complex computing has led to the renaissance of deep learning models and in turn the emergence of LLMs.

Reducing Bias Through Data Labeling

One of the major concerns with large language models is that they tend to amplify and generate biased or stereotypical content, likely inherited from biases present in the training data. High-quality data labeling can help mitigate this issue by allowing researchers to properly label biased text and content during data preprocessing.

Sensitive attributes like race, gender, sexual orientation etc. can be annotated in the training datasets. Text containing harmful stereotypes and tropes can also be flagged. Data labelers with diverse backgrounds and perspectives should be involved to identify biased content from different viewpoints. With explicit labels distinguishing biased and unbiased text, models can be trained to penalize generation of prejudiced content.

Research shows that supplementing training with human rationales explaining why certain text is biased/unbiased boosts model understanding further. Overall, thoughtful data labeling equips models to recognize and avoid generating toxic outputs.

Improving Accuracy Through Data Annotation

Large language models today still make inconsistent mistakes and hallucinate content straying from facts. Comprehensive data annotation can enhance model accuracy.

Human labelers can verify factual correctness in text and tag erroneous information. With datasets labeled for accuracy, models learn to weigh reliable vs unreliable content. Studies demonstrate accuracy improves when models are trained to mimic human rationales justifying judgments on correctness.

Data can also be annotated with common sense cues and real-world knowledge. This grounds models in logical reasoning and stops unsensible hallucinations. Work needs to continue building diverse training sets covering different domains/topics to make models broadly accurate.

Enhancing Privacy Through Data Scrubbing

Large language models risk exposing people's private information unintentionally leaked into training data. Data labeling can help preserve privacy.

Sensitive personal details like names, locations, IDs, contacts etc. can be scrubbed from datasets. Anything that can identify/profile an individual should be removed or replaced with placeholders during labeling. Context around redacted information can also be obscured to prevent models indirectly inferring it.

Establishing rigorous data labeling protocols to strip datasets of personal information will curb privacy violations. Models trained on properly scrubbed data are less likely to memorize and expose private details.

Improving Transparency Through Data Documentation

The opaque nature of many large language models makes it hard to audit what data was used to train them. Extensive data documentation through labeling can increase transparency.

Detailed metadata can be recorded on datasets - source, volume, topics coverage, demographic split etc. Documenting dataset strengths/weaknesses highlights gaps to fill. Data labelers can also identify objectionable content like hate speech for removal.

Comprehensive dataset documentation equips researchers to pick better training data and right-size models. Detailed data journals aid analysis of model behavior and defects. Overall, meticulous data labeling and auditing enables transparent model development.

Book a Demo with Sapien for High-Quality Data Labeling for LLMs

Sapien's high-quality data labeling services can help your organization develop cutting-edge large language models (LLMs) optimized for your specific needs. Our domain experts meticulously annotate training data to address key issues like bias and accuracy while providing complete transparency.

Book a demo with us today to discuss your LLM goals. Our team will collaborate with you to build a tailored data strategy leveraging proven techniques like multi-annotator consensus, outlier detection, and active learning. We seamlessly integrate with your workflow to deliver precisely labeled data fast, enhancing model performance while reducing costs.