LLMs In Context Learning (ICL)— Hands On Example

Buse Köseoğlu
7 min readApr 17, 2024

ICL is formally introduced by GPT-3. It ensures that the desired output is obtained by giving examples of language models for one or more tasks without any training or gradient updates.

ICL is a prompt engineering method. It is used for fine tuning for certain tasks. Examples for train in ICL are given in the prompt. Model weights do not change in ICL.

Larger models have higher ICL ability and give better results.

To make the example, we must first download the necessary libraries.

%pip install --upgrade pip
%pip install --disable-pip-version-check \
torch==1.13.1 \
torchdata==0.5.1 --quiet
%pip install \
transformers==4.27.2 \
datasets==2.11.0 --quiet

We will use Huggingface’s dataset and transformers libraries.

  • The datasets in huggingface can be used with the Dataset library API.
  • New models can be loaded, trained and registered with the Transformers library’s API.
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding manually labeled summaries and topics. However we will use a smaller version. The distribution of the train, test and validation sets of the dialogsum dataset we will use is as follows.

# Getting dataset from huggingface
data_name = "neil-code/dialogsum-test"
dataset = load_dataset(data_name)

print(dataset["train"][0]["dialogue"])
print("\nSummary:",dataset["train"][0]["summary"])
print("\nTopic:",dataset["train"][0]["topic"])

In the example below we can get an overview of the dataset and get an idea of what each sample looks like.

#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?
#Person2#: I found it would be a good idea to get a check-up.
#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.
#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?
#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.
#Person2#: Ok.
#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?
#Person2#: Yes.
#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.
#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.
#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.
#Person2#: Ok, thanks doctor.
Summary: Mr. Smith's getting a check-up, and Doctor Hawkins 
advises him to have one every year. Hawkins'll give some information
about their classes and medications to help Mr. Smith quit smoking.

Topic: get a check-up

The model we will use in our example is the FLAN-T5 model developed by Google.

You can check the details about the model here “https://huggingface.co/docs/transformers/model_doc/flan-t5"

model_name='google/flan-t5-base'

# load model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

After loading the model, we also need to install the tokenizer used in model training. A different tokenizer will not work for this model. The tokenizer used in the training must be installed.

# load model tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Since models cannot understand words, we need to convert them into numerical representations. Tokenizers do this.

We can perform encoded and decoded operations with Tokenizer. Below you can see the encoded and decoded version of a sentence.

sentence = "How old are you Gina?"

# encode sentence
sentence_encoded = tokenizer(sentence, return_tensors='pt')
print(sentence_encoded)
# decode sentence
sentence_decoded = tokenizer.decode(
sentence_encoded["input_ids"][0],
skip_special_tokens=True
)
print('\nENCODED SENTENCE:')
print(sentence_encoded["input_ids"][0])
print('\nDECODED SENTENCE:')
print(sentence_decoded)

First of all, we can examine what kind of result the model will give to the test data we have, without taking any action. To do this, we first get a dialog and summary from the test data. Then, we apply the tokenizer to the received dialogue and perform the encoding process.

dialogue = dataset['test'][0]['dialogue']
summary = dataset['test'][0]['summary']

# encode selected dialogue
inputs = tokenizer(dialogue, return_tensors='pt')
# Generate output from encoded data
model_generation = model.generate(inputs["input_ids"], max_new_tokens=50)[0]
# Decode generated output
output = tokenizer.decode(
model_generation,
skip_special_tokens=True
)
print(f'INPUT PROMPT:\n{dialogue}')
print(f'\nBASELINE HUMAN SUMMARY:\n{summary}')
print(f'\nMODEL GENERATION - WITHOUT PROMPT ENGINEERING:\n{output}\n')

Zero Shot Inference

It just gives the LLM a prompt to answer. No example for the answer is given in the prompt.

index = 0 # which example are going to use

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']
# prompt template for flan-t5 model
prompt = f"""
Summarize the following conversation.
{dialogue}
Summary:
"""
# Input constructed prompt instead of the dialogue.
inputs = tokenizer(prompt, return_tensors='pt')
# give encoding data to model and generate encoded output
model_generation = model.generate(inputs["input_ids"], max_new_tokens=50)[0]
# decode encoded output
output = tokenizer.decode(
model_generation,
skip_special_tokens=True
)
print(f'INPUT PROMPT:\n{prompt}')
print(f'\nBASELINE HUMAN SUMMARY:\n{summary}')
print(f'\nMODEL GENERATION - ZERO SHOT:\n{output}\n')

Output is:

BASELINE HUMAN SUMMARY: Ms. Dawson helps #Person1# to write a memo to inform every employee that they have to change the communication method and should not use Instant Messaging anymore.

MODEL GENERATION — ZERO SHOT: #Person1#: I need to take a dictation for you.

When we examine the results, we can see that the output produced by the model is not very related to our wishes. So we will try another prompt template of FLAN-T5.

index = 0

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

# another flan-t5 prompt structure is
prompt = f"""
Dialogue:
{dialogue}
What was going on?
"""

# Input constructed prompt instead of the dialogue.
inputs = tokenizer(prompt, return_tensors='pt')
# give encoding data to model and generate encoded output
model_generation = model.generate(inputs["input_ids"], max_new_tokens=50)[0]
# decode encoded output
output = tokenizer.decode(
model_generation,
skip_special_tokens=True
)
print(f'INPUT PROMPT:\n{prompt}')
print(f'\nBASELINE HUMAN SUMMARY:\n{summary}')
print(f'\nMODEL GENERATION - ZERO SHOT:\n{output}\n')

Output is:

BASELINE HUMAN SUMMARY: Ms. Dawson helps #Person1# to write a memo to inform every employee that they have to change the communication method and should not use Instant Messaging anymore.

MODEL GENERATION — ZERO SHOT: The memo is to be distributed to all employees by this afternoon.

The output obtained with this prompt is slightly better than the other one, but it is still not of the quality we want.

One Shot Inference

A question and an answer to LLM are given in the prompt. Then he is asked to answer a second question.

def make_prompt(example_indices_full, example_index_to_summarize):
"""
Make a prompt for one shot and few show inerence

Params:
---------
example_indices_full (list): Indexes to be given as examples
example_index_to_summarize (int): Index to be tested

Return:
---------
prompt (str): Prompt to be given the model
"""
# create empty prompt to fill
prompt = ''
for index in example_indices_full:
dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

# The stop sequence '{summary}\n\n\n' is important for FLAN-T5. Other models may have their own preferred stop sequence.
prompt += f"""
Dialogue:
{dialogue}
What was going on?
{summary}

"""

dialogue = dataset['test'][example_index_to_summarize]['dialogue']

prompt += f"""
Dialogue:
{dialogue}
What was going on?
"""

return prompt

example_indices_full = [0] # one sample to be given the one show inference
example_index_to_summarize = 234 # tested index
one_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)
print(one_shot_prompt) # prompt to be encoded and given the model


summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(one_shot_prompt, return_tensors='pt')

# give encoding data to model and generate encoded output
model_generation = model.generate(inputs["input_ids"], max_new_tokens=50)[0]

output = tokenizer.decode(
model_generation,
skip_special_tokens=True
)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(f'\nMODEL GENERATION - ONE SHOT:\n{output}')

Output is:

BASELINE HUMAN SUMMARY: #Person1# is interviewing #Person2#. They discuss department #Person2# wants to work in, salary, and fringe benefits.

MODEL GENERATION — ONE SHOT: The company is one of the largest and best in this field of business. It employs more than 10, 000 people throughout the world. The president is Mr. Jackson. The Shanghai branch was founded five years ago with a staff of more than 2,

This output turned out to be more complex than necessary. We couldn’t get the summary we wanted. The model could not infer much from a single example

Few Shot Inference

More than one question and answer for LLM is given within the prompt.

example_indices_full = [0,80,120,260] # samples to be given the one show inference
example_index_to_summarize = 234 # tested index

few_shot_prompt = make_prompt(example_indices_full, example_index_to_summarize)
print(few_shot_prompt) # prompt to be encoded and given the model

summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
# give encoding data to model and generate encoded output

model_generation = model.generate(inputs["input_ids"], max_new_tokens=50)[0]

output = tokenizer.decode(
model_generation,
skip_special_tokens=True
)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(f'\nMODEL GENERATION - FEW SHOT:\n{output}')

Output is:

BASELINE HUMAN SUMMARY: #Person1# is interviewing #Person2#. They discuss department #Person2# wants to work in, salary, and fringe benefits.

MODEL GENERATION — FEW SHOT: The company is one of the largest and best in this field of business. It employs more than 10, 000 people throughout the world. The president is Mr. Jackson. The Shanghai branch was founded five years ago with a staff of more than 2,

When we tried few shot inference, the result was the same as one shot. Here, we asked the model to make a prediction by giving 4 different examples. We can play with the parameters in GenerationConfig to change the results. Here we set the temperature parameter to 0.8. This parameter determines how creative the model will be.

summary = dataset['test'][example_index_to_summarize]['summary']

inputs = tokenizer(few_shot_prompt, return_tensors='pt')
# give encoding data to model and generate encoded output
model_generation = model.generate(inputs["input_ids"],
generation_config=GenerationConfig(max_new_tokens=50, do_sample=True, temperature=0.8),)[0]
output = tokenizer.decode(
model_generation,
skip_special_tokens=True
)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(f'\nMODEL GENERATION - FEW SHOT:\n{output}')

Output is:

BASELINE HUMAN SUMMARY: #Person1# is interviewing #Person2#. They discuss department #Person2# wants to work in, salary, and fringe benefits.

MODEL GENERATION — FEW SHOT: People at the company are interested in joining the position the company offers them.

When we examine the results of this, we see that this model cannot make the inferences we want. The best takeaway came in zero shot inference. For better results, we can fine-tune the model with our data and make the model work on a task we want.

References:

--

--