Long story short.. An NLP use-case on Text Summarization

Medium

Select a language for the TTS:
UK English Female
UK English Male
US English Female
US English Male
Australian Female
Australian Male
Language selected: (auto detect) - EN

Play all audios:

Long story short.. An NLP use-case on Text SummarizationAjit RajputFollow9 min read·Jul 28, 2021 --

Listen

In this fast paced age, we all want to keep ourselves updated on the daily happenings across the globe, but there are very few people who get that luxury. As for reading newspapers and

articles, skimming is the point i.e. run through the headlines first and then jump into it only if its interesting or worth reading. Especially when the content is available on click of a

button, we all prefer glance at “summarized” news aggregated from popular newspapers, websites, blogs, and magazines; delivered in three to four sentences — rather than reading lengthy, and

at times overrated and exaggerated writings in newspaper.

Text summarization which comes under AI has been an important research area that identifies the relevant sentences from a piece of text. By Text Summarization, we can get short and precise

information by preserving the contents of the text.

Google News, Inshorts, Pulse are some of the news aggregator apps take advantage of text summarization algorithms. In this post, I will take you through various traditional and advanced

methods to implement automatic Text Summarization

Text Summarization is broadly divided into two classes — Extractive Summarization and Abstractive Summarization.

Extractive summarization — This is traditional method which picks up sentences directly from the original document depending on their importance. Note that the summary obtained contains

exact sentences from the original text.Abstractive summarization — Abstractive summarization is closer to what a human usually does — i.e. conceive the text, compare it with his/her memory

and related information, and then re-create its core in a brief text. That is why the abstractive summarization is more challenging than the extractive method, as the model should break the

source corpus apart to the very tokens and regenerate the target sentences. Achieving meaningful and grammatically correct sentences in the summaries is a big deal that demands highly

precise and sophisticated models. In the next sections, let’s explore the extractive and abstractive methods which you can compare and familiarize with advantages and limitations of each

method.

Consider this extract on Tesla’s quarterly results from an business news website

text = """Tesla reported second-quarter earnings after the bell Monday, and it’s a beat on both the top and bottom lines. Shares rose about 2% after-hours. Here are the results.Earnings:

$1.45 vs 98 cents per share adjusted expected, according to Refinitiv. Revenue: $11.96 billion vs $11.30 billion expected, according to RefinitivTesla reported $1.14 billion in (GAAP) net

income for the quarter, the first time it has surpassed $1 billion. In the year-ago quarter, net income amounted to $104 million.Overall automotive revenue came in at $10.21 billion, of

which only $354 million, about 3.5%, came from sales of regulatory credits. That’s a lower number for credits than in any of the previous four quarters. Automotive gross margins were 28.4%,

higher than in any of the last four quarters.Tesla had already reported deliveries (its closest approximation to sales) of 201,250 electric vehicles, and production of 206,421 total

vehicles, during the quarter ended June 30, 2021.The company also reported $801 million in revenue from its energy business, including solar photovoltaics and energy storage systems for

homes, businesses and utilities, an increase of more than 60% from last quarter. While Tesla does not disclose how many energy storage units it sells each quarter, in recent weeks CEO Elon

Musk said, in court, that the company would only be able to produce 30,000 to 35,000 at best during the current quarter, blaming the lag on chip shortages. Tesla also reported $951 million

in services and other revenues. The company now operates 598 stores and service centers, and a mobile service fleet including 1,091 vehicles, an increase of just 34% versus a year ago. That

compares with an increase of 121% in vehicle deliveries year over year. A $23 million impairment related to the value of its bitcoin holdings was reported as an operating expense under

“Restructuring and other.”""" Let’s start with installing and importing the libraries required to run the summarization process using different methods.

!pip install bert-extractive-summarizer!pip install spacy!pip install transformers!pip install torch!pip install sentencepiece Import the libraries

import gensimfrom gensim.summarization import summarizeimport torchfrom transformers import pipelinefrom transformers import T5Tokenizerfrom transformers import

T5ForConditionalGenerationfrom transformers import T5Configfrom summarizer import Summarizerfrom summarizer import TransformerSummarizerSummarization using Gensim (Text Rank) gensim package

is used for natural language processing and information retrievals tasks such as topic modeling, document indexing, word2vec, and similarity retrieval. Here we are using it for text

summarization using TextRank algorithm.

TextRank is an extractive and unsupervised text summarization technique, based on the concept that words which occur more frequently are significant. Hence, the sentences containing highly

frequent words are important . Based on this, the algorithm assigns scores to each sentence in the text. The top-ranked sentences then make it to the summary.

Let us use the gensim’ssummarize function using different summarizing parameters given below

ratio: It can take values between 0 to 1. It represents the proportion of the summary compared to the original text.word_count: It decides the no of words in the summary. Summarize by Ratio

summary_by_ratio = summarize(text, ratio=0.15) print("Summary : \n" + summary_by_ratio)Output >>>Summary : The company also reported $801 million in revenue from its energy business,

including solar photovoltaics and energy storage systems for homes, businesses and utilities, an increase of more than 60% from last quarter. Tesla also reported $951 million in services and

other revenues. Summarize by Word Count

summary_by_count=summarize(text, word_count=60)print("Summary : " + summary_by_count)Output >>>Summary : Overall automotive revenue came in at $10.21 billion, of which only $354 million,

about 3.5%, came from sales of regulatory credits. The company also reported $801 million in revenue from its energy business, including solar photovoltaics and energy storage systems for

homes, businesses and utilities, an increase of more than 60% from last quarter. Tesla also reported $951 million in services and other revenues.Abstractive Summarization with pre-trained

models Here we use transformers library in Python to perform abstractive text summarization on the input text. All the documentation for the transformers library can be found on this

website: https://huggingface.co/transformers/

Let us now explore summarization techniques using below methods.

Pipeline APIT5 TransformerBERTGPT2XLNetSummarization using Pipeline API The most straightforward way to use models in transformers is using the pipeline API. The pipelines are a great and

easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named

Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering

The models that this pipeline can use are models that have been fine-tuned on a summarization task, which is currently, ‘bart-large-cnn’, ‘t5-small’, ‘t5-base’, ‘t5-large’, ‘t5–3b’, ‘t5–11b’

Run the Summarization pipeline with default model

summarization = pipeline("summarization")## Call the transformer's summarization api by passing textabstract_text = summarization(text)[0]['summary_text']print("Summary:", abstract_text)

Note: The first time you execute this or any of the techniques listed, it make take a while to download the model architecture and the weights, as well as tokenizer configuration in some

cases.

Here is the summary generated after executing the above pipeline code

Output >>>Summary: Tesla reported $1.14 billion in (GAAP) net income for the quarter, the first time it has surpassed $1 billion . The company also reported $801 million in revenue from its

energy business, including solar photovoltaics and energy storage systems . Shares rose about 2% after-hours . The abstractive summary generated looks pretty decent, isn’t it !

Run the same pipeline by explicitly passing the t5-base model

## setup the pipeline with t5-base modelt5summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf")## Build the summary with min 5 and max 60

wordst5summarizer(text, min_length=5, max_length=60)Output >>>Summary: Tesla reported $1.45 vs 98 cents per share adjusted expected, according to Refinitiv . overall automotive revenue came

in at $10.21 billion, of which only $354 million, about 3.5%, came from sales of regulatory creditsSummarization with T5 Transformer T5 is an encoder-decoder model pre-trained on a

multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a

different prefix to the input corresponding to each task

Instantiate the pretrained “t5-small” model through from_pretrained method.

t5model = T5ForConditionalGeneration.from_pretrained('t5-small')t5tokenizer = T5Tokenizer.from_pretrained('t5-small')device = torch.device('cpu') Add the string summarize: at the beginning

of your raw text and encode . T5 transformers performs different tasks by prepending the particular prefix to the input text.

t5tokenized_text = t5tokenizer.encode("summarize:"+ text, truncation=True, return_attention_mask=True, add_special_tokens=True, padding='max_length', return_tensors="pt").to(device) Next you

will pass the input_ids from the tokenized text returned from the encode function along with other parameters. The technique used here is “Beam Search”. Beam search reduces the risk of

missing hidden high probability word sequences by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest

probability

Read more about Greedy Search and Beam Search here : https://huggingface.co/blog/how-to-generate

t5summary_ids = t5model.generate(input_ids=t5tokenized_text['input_ids'], attention_mask=t5tokenized_text['attention_mask'], num_beams=3, min_length=20, max_length=70,

repetition_penalty=2.0, early_stopping=True) The parameters explanation are as below;

max_length: The maximum number of tokens to generate.min_length: This is the minimum number of tokens to generate.length_penalty: Exponential penalty to the length, 1.0 means no penalty,

increasing this parameter, will increase the length of the output text.num_beams: Specifying this parameter, will lead the model to use beam search instead of greedy search, setting

num_beams to 4, will allow the model to lookahead for 4 possible words (1 in the case of greedy search).early_stopping: We set it to True, so that generation is finished when all beam

hypotheses reached the end of string token (EOS).output = t5tokenizer.decode(t5summary_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)print ("Summary:", output)Output

>>>Summary: shares rose about 2% after-hours, according to Refinitiv. in the year-ago quarter, net income amounted to $104 million. overall revenue came in at $10.21 billion, of which only

$354 million came from regulatory credits.Summarization with BERT Model BERT (Bidirectional transformer) is a transformer used to overcome the limitations of RNN and other neural networks as

Long term dependencies. It is a pre-trained model that is naturally bidirectional. This pre-trained model can be tuned to easily perform the NLP tasks as specified, Summarization in our

case.

bert_model = Summarizer()bert_summary = ''.join(bert_model(text, min_length=60))print("Summary: " + bert_summary)Output >>>Summary: Tesla reported second-quarter earnings after the bell

Monday, and it’s a beat on both the top and bottom lines. The company also reported $801 million in revenue from its energy business, including solar photovoltaics and energy storage systems

for homes, businesses and utilities, an increase of more than 60% from last quarter. That compares with an increase of 121% in vehicle deliveries year over year.Summarization with GPT2

Model Generative Pre-trained Transformer (GPT) models by OpenAI have taken natural language processing (NLP) community by storm by introducing very powerful language models. These models can

perform various NLP tasks like question answering, textual entailment, text summarization etc. without any supervised training.

GPT-2 is capable of next word prediction on a much larger and more sophisticated scale. For reference, the smallest available GPT-2 has 117 million parameters, whereas the largest one

(invisible to the public) has over 1.5 billion parameters. The largest one available for public use is half the size of their main GPT-2 model.

Let’s load the model, and get the generated summary!

GPT2_model = TransformerSummarizer(transformer_type="GPT2", transformer_model_key="gpt2-medium")gpt_summary = ''.join(GPT2_model(text, min_length=60))print(gpt_summary)Output >>>Summary:

Tesla reported second-quarter earnings after the bell Monday, and it’s a beat on both the top and bottom lines. In the year-ago quarter, net income amounted to $104 million. The company also

reported $801 million in revenue from its energy business, including solar photovoltaics and energy storage systems for homes, businesses and utilities, an increase of more than 60% from

last quarter. Wow! much better than Bert! Look at such an insightful summary it has generated!

Now, let’s jump into our final model, the XLNet!

Summarization with XLNet XLNet is a generalized autoregressive language model that learns unsupervised representations of text sequences. This model incorporates modelling techniques from

Autoencoder(AE) models(BERT) into AR models while avoiding limitations of AE.

Let’s load the model and print out the summary/

xlnet_model = TransformerSummarizer(transformer_type="XLNet", transformer_model_key="xlnet-base-cased")xlnet_summary = ''.join(xlnet_model(text, min_length=60))print("Summary: " +

xlnet_summary)Output >>>Summary: Tesla reported second-quarter earnings after the bell Monday, and it’s a beat on both the top and bottom lines. That’s a lower number for credits than in any

of the previous four quarters. The company also reported $801 million in revenue from its energy business, including solar photovoltaics and energy storage systems for homes, businesses and

utilities, an increase of more than 60% from last quarter. For all the transformer models, the generated results illustrate that it works really well, which is really impressive! 👌

Conclusion Text summarization has become an important and timely tool for assisting and interpreting text information in today’s fast-growing information age. Its application can be seen

where ever text document is involved irrespective of the field. Due to its applications and ease that it brings, text summarization is a popular topic in the field of Natural language

processing.

I hope this article gives you heads-up on the different types and different approaches that are used for the text summarization. If yes, please do share clap.. 👏

Gourmet food on demand? Oh yes, there's an app for that

Craving a Cambodian pork burger crafted by an alum of Daniel Boulud's kitchen? How about West African peanut chicken ste...

Cabot oil & gas declares dividend

HOUSTON, Oct. 24, 2012 /PRNewswire/ -- Cabot Oil & Gas Corporation (NYSE: COG) today announced that its Board of Dir...

Low-molecular-weight heparin beyond 12 months in patients with cancer-associated thrombosis

Clinical guidelines indicate that in patients with cancer-associated thrombosis (CAT), anticoagulant treatment should be...

Rohit Sharma reveals how MS Dhoni's absence gave MI massive boost in ending CSK's unbeaten home run

Mumbai Indians (MI) were coming to match 44, where they locked horns with arch-rivals Chennai Super Kings (CSK) at MA Ch...

Erythropoietin and the development of erythrocytes: emergence of two erythroblast lines in erythropoiesis during growth and haemorrhage in the rat

Access through your institution Buy or subscribe This is a preview of subscription content, access via your institution ...

England spinner Adil Rashid donates relief packages for earthquake victims in Mirpur

England cricketer Adil Rashid recently visited Pakistan to help the people in the earthquake-affected areas of Mirpur. O...