Long story short.. An NLP use-case on Text Summarization

Long story short.. An NLP use-case on Text Summarization


Play all audios:


Long story short.. An NLP use-case on Text SummarizationAjit RajputFollow9 min read·Jul 28, 2021 --


Listen


Share


In this fast paced age, we all want to keep ourselves updated on the daily happenings across the globe, but there are very few people who get that luxury. As for reading newspapers and


articles, skimming is the point i.e. run through the headlines first and then jump into it only if its interesting or worth reading. Especially when the content is available on click of a


button, we all prefer glance at “summarized” news aggregated from popular newspapers, websites, blogs, and magazines; delivered in three to four sentences — rather than reading lengthy, and


at times overrated and exaggerated writings in newspaper.


Text summarization which comes under AI has been an important research area that identifies the relevant sentences from a piece of text. By Text Summarization, we can get short and precise


information by preserving the contents of the text.


Google News, Inshorts, Pulse are some of the news aggregator apps take advantage of text summarization algorithms. In this post, I will take you through various traditional and advanced


methods to implement automatic Text Summarization


Text Summarization is broadly divided into two classes — Extractive Summarization and Abstractive Summarization.


Extractive summarization — This is traditional method which picks up sentences directly from the original document depending on their importance. Note that the summary obtained contains


exact sentences from the original text.Abstractive summarization — Abstractive summarization is closer to what a human usually does — i.e. conceive the text, compare it with his/her memory


and related information, and then re-create its core in a brief text. That is why the abstractive summarization is more challenging than the extractive method, as the model should break the


source corpus apart to the very tokens and regenerate the target sentences. Achieving meaningful and grammatically correct sentences in the summaries is a big deal that demands highly


precise and sophisticated models. In the next sections, let’s explore the extractive and abstractive methods which you can compare and familiarize with advantages and limitations of each


method.


Consider this extract on Tesla’s quarterly results from an business news website


text = """Tesla reported second-quarter earnings after the bell Monday, and it’s a beat on both the top and bottom lines. Shares rose about 2% after-hours. Here are the results.Earnings:


$1.45 vs 98 cents per share adjusted expected, according to Refinitiv. Revenue: $11.96 billion vs $11.30 billion expected, according to RefinitivTesla reported $1.14 billion in (GAAP) net


income for the quarter, the first time it has surpassed $1 billion. In the year-ago quarter, net income amounted to $104 million.Overall automotive revenue came in at $10.21 billion, of


which only $354 million, about 3.5%, came from sales of regulatory credits. That’s a lower number for credits than in any of the previous four quarters. Automotive gross margins were 28.4%,


higher than in any of the last four quarters.Tesla had already reported deliveries (its closest approximation to sales) of 201,250 electric vehicles, and production of 206,421 total


vehicles, during the quarter ended June 30, 2021.The company also reported $801 million in revenue from its energy business, including solar photovoltaics and energy storage systems for


homes, businesses and utilities, an increase of more than 60% from last quarter. While Tesla does not disclose how many energy storage units it sells each quarter, in recent weeks CEO Elon


Musk said, in court, that the company would only be able to produce 30,000 to 35,000 at best during the current quarter, blaming the lag on chip shortages. Tesla also reported $951 million


in services and other revenues. The company now operates 598 stores and service centers, and a mobile service fleet including 1,091 vehicles, an increase of just 34% versus a year ago. That


compares with an increase of 121% in vehicle deliveries year over year. A $23 million impairment related to the value of its bitcoin holdings was reported as an operating expense under


“Restructuring and other.”""" Let’s start with installing and importing the libraries required to run the summarization process using different methods.


!pip install bert-extractive-summarizer!pip install spacy!pip install transformers!pip install torch!pip install sentencepiece Import the libraries


import gensimfrom gensim.summarization import summarizeimport torchfrom transformers import pipelinefrom transformers import T5Tokenizerfrom transformers import


T5ForConditionalGenerationfrom transformers import T5Configfrom summarizer import Summarizerfrom summarizer import TransformerSummarizerSummarization using Gensim (Text Rank) gensim package


is used for natural language processing and information retrievals tasks such as topic modeling, document indexing, word2vec, and similarity retrieval. Here we are using it for text


summarization using TextRank algorithm.


TextRank is an extractive and unsupervised text summarization technique, based on the concept that words which occur more frequently are significant. Hence, the sentences containing highly


frequent words are important . Based on this, the algorithm assigns scores to each sentence in the text. The top-ranked sentences then make it to the summary.


Let us use the gensim’ssummarize function using different summarizing parameters given below


ratio: It can take values between 0 to 1. It represents the proportion of the summary compared to the original text.word_count: It decides the no of words in the summary. Summarize by Ratio


:


summary_by_ratio = summarize(text, ratio=0.15) print("Summary : \n" + summary_by_ratio)Output >>>Summary : The company also reported $801 million in revenue from its energy business,


including solar photovoltaics and energy storage systems for homes, businesses and utilities, an increase of more than 60% from last quarter. Tesla also reported $951 million in services and


other revenues. Summarize by Word Count


summary_by_count=summarize(text, word_count=60)print("Summary : " + summary_by_count)Output >>>Summary : Overall automotive revenue came in at $10.21 billion, of which only $354 million,


about 3.5%, came from sales of regulatory credits. The company also reported $801 million in revenue from its energy business, including solar photovoltaics and energy storage systems for


homes, businesses and utilities, an increase of more than 60% from last quarter. Tesla also reported $951 million in services and other revenues.Abstractive Summarization with pre-trained


models Here we use transformers library in Python to perform abstractive text summarization on the input text. All the documentation for the transformers library can be found on this


website: https://huggingface.co/transformers/


Let us now explore summarization techniques using below methods.


Pipeline APIT5 TransformerBERTGPT2XLNetSummarization using Pipeline API The most straightforward way to use models in transformers is using the pipeline API. The pipelines are a great and


easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named


Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering


The models that this pipeline can use are models that have been fine-tuned on a summarization task, which is currently, ‘bart-large-cnn’, ‘t5-small’, ‘t5-base’, ‘t5-large’, ‘t5–3b’, ‘t5–11b’


Run the Summarization pipeline with default model


summarization = pipeline("summarization")## Call the transformer's summarization api by passing textabstract_text = summarization(text)[0]['summary_text']print("Summary:", abstract_text)


Note: The first time you execute this or any of the techniques listed, it make take a while to download the model architecture and the weights, as well as tokenizer configuration in some


cases.


Here is the summary generated after executing the above pipeline code


Output >>>Summary: Tesla reported $1.14 billion in (GAAP) net income for the quarter, the first time it has surpassed $1 billion . The company also reported $801 million in revenue from its


energy business, including solar photovoltaics and energy storage systems . Shares rose about 2% after-hours . The abstractive summary generated looks pretty decent, isn’t it !


Run the same pipeline by explicitly passing the t5-base model


## setup the pipeline with t5-base modelt5summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf")## Build the summary with min 5 and max 60


wordst5summarizer(text, min_length=5, max_length=60)Output >>>Summary: Tesla reported $1.45 vs 98 cents per share adjusted expected, according to Refinitiv . overall automotive revenue came


in at $10.21 billion, of which only $354 million, about 3.5%, came from sales of regulatory creditsSummarization with T5 Transformer T5 is an encoder-decoder model pre-trained on a


multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a


different prefix to the input corresponding to each task


Instantiate the pretrained “t5-small” model through from_pretrained method.


t5model = T5ForConditionalGeneration.from_pretrained('t5-small')t5tokenizer = T5Tokenizer.from_pretrained('t5-small')device = torch.device('cpu') Add the string summarize: at the beginning


of your raw text and encode . T5 transformers performs different tasks by prepending the particular prefix to the input text.


t5tokenized_text = t5tokenizer.encode("summarize:"+ text, truncation=True, return_attention_mask=True, add_special_tokens=True, padding='max_length', return_tensors="pt").to(device) Next you


will pass the input_ids from the tokenized text returned from the encode function along with other parameters. The technique used here is “Beam Search”. Beam search reduces the risk of


missing hidden high probability word sequences by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest


probability


Read more about Greedy Search and Beam Search here : https://huggingface.co/blog/how-to-generate


t5summary_ids = t5model.generate(input_ids=t5tokenized_text['input_ids'], attention_mask=t5tokenized_text['attention_mask'], num_beams=3, min_length=20, max_length=70,


repetition_penalty=2.0, early_stopping=True) The parameters explanation are as below;


max_length: The maximum number of tokens to generate.min_length: This is the minimum number of tokens to generate.length_penalty: Exponential penalty to the length, 1.0 means no penalty,


increasing this parameter, will increase the length of the output text.num_beams: Specifying this parameter, will lead the model to use beam search instead of greedy search, setting


num_beams to 4, will allow the model to lookahead for 4 possible words (1 in the case of greedy search).early_stopping: We set it to True, so that generation is finished when all beam


hypotheses reached the end of string token (EOS).output = t5tokenizer.decode(t5summary_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)print ("Summary:", output)Output


>>>Summary: shares rose about 2% after-hours, according to Refinitiv. in the year-ago quarter, net income amounted to $104 million. overall revenue came in at $10.21 billion, of which only


$354 million came from regulatory credits.Summarization with BERT Model BERT (Bidirectional transformer) is a transformer used to overcome the limitations of RNN and other neural networks as


Long term dependencies. It is a pre-trained model that is naturally bidirectional. This pre-trained model can be tuned to easily perform the NLP tasks as specified, Summarization in our


case.


bert_model = Summarizer()bert_summary = ''.join(bert_model(text, min_length=60))print("Summary: " + bert_summary)Output >>>Summary: Tesla reported second-quarter earnings after the bell


Monday, and it’s a beat on both the top and bottom lines. The company also reported $801 million in revenue from its energy business, including solar photovoltaics and energy storage systems


for homes, businesses and utilities, an increase of more than 60% from last quarter. That compares with an increase of 121% in vehicle deliveries year over year.Summarization with GPT2


Model Generative Pre-trained Transformer (GPT) models by OpenAI have taken natural language processing (NLP) community by storm by introducing very powerful language models. These models can


perform various NLP tasks like question answering, textual entailment, text summarization etc. without any supervised training.


GPT-2 is capable of next word prediction on a much larger and more sophisticated scale. For reference, the smallest available GPT-2 has 117 million parameters, whereas the largest one


(invisible to the public) has over 1.5 billion parameters. The largest one available for public use is half the size of their main GPT-2 model.


Let’s load the model, and get the generated summary!


GPT2_model = TransformerSummarizer(transformer_type="GPT2", transformer_model_key="gpt2-medium")gpt_summary = ''.join(GPT2_model(text, min_length=60))print(gpt_summary)Output >>>Summary:


Tesla reported second-quarter earnings after the bell Monday, and it’s a beat on both the top and bottom lines. In the year-ago quarter, net income amounted to $104 million. The company also


reported $801 million in revenue from its energy business, including solar photovoltaics and energy storage systems for homes, businesses and utilities, an increase of more than 60% from


last quarter. Wow! much better than Bert! Look at such an insightful summary it has generated!


Now, let’s jump into our final model, the XLNet!


Summarization with XLNet XLNet is a generalized autoregressive language model that learns unsupervised representations of text sequences. This model incorporates modelling techniques from


Autoencoder(AE) models(BERT) into AR models while avoiding limitations of AE.


Let’s load the model and print out the summary/


xlnet_model = TransformerSummarizer(transformer_type="XLNet", transformer_model_key="xlnet-base-cased")xlnet_summary = ''.join(xlnet_model(text, min_length=60))print("Summary: " +


xlnet_summary)Output >>>Summary: Tesla reported second-quarter earnings after the bell Monday, and it’s a beat on both the top and bottom lines. That’s a lower number for credits than in any


of the previous four quarters. The company also reported $801 million in revenue from its energy business, including solar photovoltaics and energy storage systems for homes, businesses and


utilities, an increase of more than 60% from last quarter. For all the transformer models, the generated results illustrate that it works really well, which is really impressive! 👌


Conclusion Text summarization has become an important and timely tool for assisting and interpreting text information in today’s fast-growing information age. Its application can be seen


where ever text document is involved irrespective of the field. Due to its applications and ease that it brings, text summarization is a popular topic in the field of Natural language


processing.


I hope this article gives you heads-up on the different types and different approaches that are used for the text summarization. If yes, please do share clap.. 👏