Financial news sentiment analysis in Python

In the dynamic world of finance, understanding market sentiment is crucial for researchers and practitioners alike.

In this tutorial,I present three methods for extracting sentiments from financial news using Python. Starting with a simple yet effective dictionary-based approach, we progress to exploring FinBert, a sophisticated BERT-based model tailored for financial texts. Lastly, we examine the capabilities of large language models (LLMs) in sentiment analysis.

All the code for this tutorial is available on GitHub.

Video tutorial

This post is also available as a video tutorial on YouTube.

import pandas as pd

Dependencies

To follow along with this tutorial, you will need the following Python libraries:

For managing data:

pandas

For loading the news data:

langchain
newspaper3k
lxml-html-clean

For BERT-based sentiment analysis:

transformers
sentence-transformers
torch
scipy

For large language models:

langchain
tenacity

Optional (to use additional LLMs those in Ollama):

langchain-experimental
openai
langchain-openai

Financial news data

First of all, let’s load some financial news data to work with. For that, we will use the langchain library, which provides a simple interface to load news data from various sources. In our case, we will load news articles from Yahoo Finance, and load it into a pandas DataFrame.

from langchain_community.document_loaders import NewsURLLoader

urls = [
    "https://ca.finance.yahoo.com/news/rising-oil-price-doesnt-shake-160556917.html",
    "https://ca.finance.yahoo.com/news/venezuela-detains-two-former-maduro-173140679.html",
    "https://ca.finance.yahoo.com/news/norfolk-southern-agrees-pay-600m-121211343.html",
    "https://ca.finance.yahoo.com/news/boeing-shares-fall-nyt-report-164509069.html",
    "https://ca.finance.yahoo.com/news/restaurants-along-eclipses-path-totality-153650201.html",
]

loader = NewsURLLoader(urls=urls)
data = loader.load()

data

[Document(page_content='TORONTO — The price of oil has been on a steady climb all year, but the talk at Canada\'s biggest oil and gas conference is still focused on spending discipline.\n\nIndustry leaders at the Canadian Association of Petroleum Producers conference, held in Toronto this year, have been emphasizing their predictability and focus on returning money to shareholders, rather than talk of growth.\n\nSuncor Energy Inc. chief executive Rich Kruger, who was named head of the oil and gas producer last year as it struggled with safety and operational issues, said his goal is to bring clarity and simplicity to the company.\n\n"I want to become consistently and boringly excellent," said Kruger. "I\'m not a big one for surprise parties."\n\nADVERTISEMENT\n\nKruger has been working to standardize operations and create a steadier production plan, in contrast to some of the more rushed decisions when growth was the answer to all of the industry\'s questions.\n\nThe early development of the Fort Hills oilsands site, for example, saw mine plans that had slope angles too steep, and not enough was done to check for water issues, in what were fairly short-sighted decisions made to feed the processing plant faster, he said.\n\n"If you go back 10-plus years ago, we lived in a world we thought had resource scarcity, oil prices are going be $100 or better, where growth in production volumes was synonymous with growth in value, a different world than we live in today."\n\nEven with oil up about US$15 per barrel so far this year to US$85, industry leaders at the conference have been emphasizing that they no longer see production growth as so deeply tied to value, and that each added barrel has to be weighed against returning money to shareholders.\n\nThe shift is happening as investors worry about long-term demand prospects for fossil fuels as the push to reduce carbon emissions ramps up.\n\nHowever, forecasts do show that oil demand is still growing, said BMO analyst Randy Ollenberger.\n\nStory continues\n\n"We often hear the narrative that oil demand has peaked, that it\'s not growing and how that\'s negative for the space. That\'s not true, oil demand is actually continuing to grow, and in fact, it\'s continuing to grow at a pace that\'s higher than the average over the last 13 years."\n\nStill, with investors looking for the industry to reliably pump out cash, as much, if not more than they\'re looking for growth, company leaders are eager to assure they won\'t be lost in exuberance as prices rise.\n\nCenovus Energy Inc. CEO Jon McKenzie said his company is planning restrained and strategic growth, focused on reducing bottlenecks and finishing shelved projects.\n\n"Growth that we’ve kicked off in 2023 is very different than the kind of growth you would have seen 10, 15 years ago. We’re not talking about greenfield expansion, we’re not talking about phased expansions."\n\nSmaller producers were also keen to emphasize that they were no longer growing for growth\'s sake, including Whitecap Resources Inc. chief executive Grant Fagerheim.\n\n"Managing growth in a very disciplined manner, I think that’s a mantra that has been introduced to the energy sector, and I’m proud to be part of it."\n\nThis report by The Canadian Press was first published April 9, 2024.\n\nCompanies in this story: (TSX:SU, TSX:CVE, TSX:WCP)\n\nThe Canadian Press\n\nNote to readers: This is a corrected story. A previous version incorrectly called Suncor Energy Canada\'s largest oil and gas producer.', metadata={'title': "Higher oil doesn't shake industry talk on spending discipline at CAPP conference", 'link': 'https://ca.finance.yahoo.com/news/rising-oil-price-doesnt-shake-160556917.html', 'authors': ['The Canadian Press'], 'language': 'en', 'description': "TORONTO — The price of oil has been on a steady climb all year, but the talk at Canada's biggest oil and gas conference is still focused on spending discipline. Industry leaders at the Canadian Association of Petroleum Producers conference, held in Toronto this year, have been emphasizing their predictability and focus on returning money to shareholders, rather than talk of growth. Suncor Energy Inc. chief executive Rich Kruger, who was named head of the oil and gas producer last year as it stru", 'publish_date': None}),
 Document(page_content='(Bloomberg) -- Venezuela detained former oil and finance ministers Tareck El Aissami and Simón Zerpa more than a year after an investigation into billions of lost Petroleos de Venezuela SA revenue that’s led to a purge of the ruling elite’s inner circle.\n\nMost Read from Bloomberg\n\nPublic Prosecutor Tarek William Saab said El Aissami and Zerpa, once President Nicolás Maduro’s closest allies, are part of a group of more than 50 others involved in a scheme to “destroy Venezuela’s economy,” including “corrupt” bankers based in Miami and Washington. Aissami’s business partner, Samark López, was also arrested, Saab said.\n\nADVERTISEMENT\n\nEl Aissami, who was shown wearing handcuffs and a black t-shirt, will be charged with treason, appropriation and money laundering, Saab said.\n\nThe arrests come at a pivotal moment for Maduro, who is facing criticism of unfair conditions ahead of presidential vote in July, in which he is seeking to win a third consecutive term. The US has said it’s willing to let an important oil and gas license expire on April 18 if the socialist government doesn’t take steps toward freer elections.\n\nRead more: Power Struggle and Missing Billions Roil Venezuelan Ruling Elite\n\nEl Aissami resigned as Venezuela’s oil minister in March 2023 when Maduro launched a sweeping anti-corruption operation after an internal audit revealed a financial black hole at state-owned PDVSA, the government’s most important source of funding. That triggered a widening corruption investigation that ensnared judges, elected officials, and the head of the nation’s crypto regulator.\n\nSaab said that El Aissami paid millions of dollars in bribes to allocate Venezuelan crude, pet-coke and fuel oil below market value through shell companies that bypassed the central bank. They also used the revenue from those sales to manipulate the country’s currency exchange system, Saab said.\n\nStory continues\n\n--With assistance from Fabiola Zerpa.\n\n(Updates with charges against El Aissami and US context starting in third paragraph.)\n\nMost Read from Bloomberg Businessweek\n\n©2024 Bloomberg L.P.', metadata={'title': 'Venezuela Detains Former Maduro Confidantes in PDVSA Probe', 'link': 'https://ca.finance.yahoo.com/news/venezuela-detains-two-former-maduro-173140679.html', 'authors': ['Patricia Laya', 'Andreina Itriago Acosta'], 'language': 'en', 'description': '(Bloomberg) -- Venezuela detained former oil and finance ministers Tareck El Aissami and Simón Zerpa more than a year after an investigation into billions of lost Petroleos de Venezuela SA revenue that’s led to a purge of the ruling elite’s inner circle. Most Read from BloombergUS Slams Strikes on Russia Oil Refineries as Risk to Oil MarketsBond Trader Places Record Futures Bet on Eve of Inflation DataIran’s Better, Stealthier Drones Are Remaking Global WarfareTrumpism Is Emptying ChurchesUkrain', 'publish_date': None}),
 Document(page_content="Norfolk Southern has agreed to pay $600 million in a class-action lawsuit settlement for a fiery February 2023 train derailment in Ohio, but residents worry the money not only won’t go far enough to cover future health needs that could be tremendous but also won't amount to much once divvied up.\n\n“It’s not nowhere near my needs, let alone what the health effects are going to be five or 10 years down the road,” said Eric Cozza, who lived just three blocks from the derailment and had 47 family members living within a mile (1.61 kilometers).\n\nMore than three dozen of the freight train's 149 cars derailed on the outskirts of East Palestine, a town of almost 5,000 residents near the Pennsylvania state line. Several cars spilled a cocktail of hazardous materials that caught fire. Three days later, officials, fearing an explosion, blew open five tankcars filled with vinyl chloride and burned the toxic chemical — sending thick, black plumes of smoke into the air. Some 1,500 to 2,000 residents were evacuated.\n\nNorfolk Southern said the agreement, if approved by the court, will resolve all class action claims within a 20-mile (32-kilometer) radius of the derailment and, for residents who choose to participate, personal injury claims within a 10-mile (16-kilometer) radius of the derailment.\n\nADVERTISEMENT\n\nThe area includes East Palestine and people who evacuated, as well as several other larger towns.\n\nThe settlement, which doesn’t include or constitute any admission of liability, wrongdoing or fault, represents only a small slice of the $3 billion in revenue Norfolk Southern generated just in the first three months of this year. The railroad said that even after the settlement it still made a $213 million profit in the quarter.\n\nEast Palestine resident Krissy Ferguson called the settlement a “heart-wrenching day.”\n\n“I just feel like we’ve been victimized over and over and over again,” she said. “We fought and we’re still fighting. And contamination is still flowing down the creeks. People are still sick. And I think people that had the power to fight took an easy way out.”\n\nStory continues\n\nMore than a year later residents still complain about respiratory problems and unexplained rashes and nosebleeds, but the greater fear is that people will develop cancer or other serious conditions because of the chemicals they were exposed to. Researchers have only begun to work on determining the lasting repercussions of the derailment.\n\nThe company said Tuesday that individuals and businesses will be able to use compensation from the settlement in any manner they see fit.\n\nThe settlement is expected to be submitted for preliminary approval to the U.S. District Court for the Northern District of Ohio this month. Payments could begin to arrive by the end of the year, subject to final court approval.\n\nNorfolk Southern has already spent more than $1.1 billion on its response to the derailment, including more than $104 million in direct aid to East Palestine and its residents. Partly because Norfolk Southern is paying for the cleanup, President Joe Biden has never declared a disaster in the town, which remains a sore point for many.\n\nThe railroad has promised to create a fund to help pay for the long-term health needs of the community, but that hasn’t been finalized yet.\n\nThe plaintiffs' attorneys said the deal follows a year of intense investigation and should provide meaningful relief to residents.\n\nStill, residents like Misti Allison have many unanswered questions.\n\n“What goes through my head is, after all the lawyers are paid and the legal fees are accounted for, how much funding will be provided for families? And is that going to be enough for any of these potential damages moving forward?” she said.\n\nJami Wallace, too, worries about having a settlement without knowing the long-term impact of the derailment.\n\n“I would really like to see the numbers because in my opinion, taking a plea deal only is in the best interest of the attorneys,” she said. “They’re all going to get their money. But we’re the residents that are still going to be left to suffer.”\n\nCozza said he spent about $8,000 to move out of town and that — along with medical bills and the cost of replacing his contaminated belongings — exhausted what little savings he had. And he can't put a price on the 10-year relationship he lost or the way his extended family was scattered after the derailment.\n\nThe CEO of Threshold Residential, one of the biggest employers in town, estimates that his business has lost well over $100,000.\n\nLast week federal officials said that the aftermath of the train derailment doesn’t qualify as a public health emergency because widespread health problems and ongoing chemical exposure haven’t been documented, contrasting residents' reports.\n\nThe head of the National Transportation Safety Board recently said the agency’s investigation showed that venting and burning of the vinyl chloride was unnecessary because the producer of the chemical ascertained that no dangerous reaction occurred inside the tank cars. Officials who made the decision — Ohio’s governor and the local fire chief leading the response — have said they were never told that.\n\nThe NTSB’s full investigation into the cause of the derailment won’t be complete until June, but the agency has said that an overheating wheel bearing on one of the railcars, which wasn’t detected in time by a trackside sensor, likely caused the crash.\n\nThe EPA has said cleanup in East Palestine is expected to be completed this year.\n\nThe railroad announced preliminary first-quarter earnings of 23 cents per share Tuesday, which reflects the cost of the $600 million settlement. Without the settlement and some other one-time costs, the railroad said it would have made $2.39 per share.\n\nRailroad CEO Alan Shaw, who is fighting for his job against an activist investor aiming to overhaul the railroad's operations, said Norfolk Southern is “becoming a more productive and efficient railroad” but acknowledged there is more work to do.\n\nAncora Holdings is trying to persuade investors to support its nominees for Norfolk Southern's board and its plan to replace Shaw and the rest of the management team at the railroad's May 9 annual meeting. Ancora says the company's profits have lagged behind the other major freight railroads for years and the investors question Shaw's leadership.\n\nThe railroad said that in addition to the settlement, its results were hurt by $91 million in unusual expenses including money it is spending to fight back against Ancora, management layoffs and the $25 million it gave to CPKC last month for the right to hire one of that railroad's executives to be Norfolk Southern's new chief operating officer.\n\nThe railroad said Tuesday that even though volume was up 4% during the first quarter, company revenue fell by 4% because of lower fuel surcharge revenue and changes in the mix of shipments it handled to include more containers of imported goods that railroads get paid less to deliver than other commodities.\n\nShares of Norfolk Southern Corp., based in Atlanta, were up slightly through the day after a flat start following the settlement announcement.\n\n___\n\nAssociated Press writer Brooke Schultz contributed to this report from Harrisburg, Pennsylvania, and Michelle Chapman contributed from New York.\n\nJosh Funk, The Associated Press", metadata={'title': 'Norfolk Southern agrees to $600M settlement in fiery Ohio derailment. Locals fear it’s not enough', 'link': 'https://ca.finance.yahoo.com/news/norfolk-southern-agrees-pay-600m-121211343.html', 'authors': ['The Canadian Press'], 'language': 'en', 'description': "Norfolk Southern has agreed to pay $600 million in a class-action lawsuit settlement for a fiery February 2023 train derailment in Ohio, but residents worry the money not only won’t go far enough to cover future health needs that could be tremendous but also won't amount to much once divvied up. “It’s not nowhere near my needs, let alone what the health effects are going to be five or 10 years down the road,” said Eric Cozza, who lived just three blocks from the derailment and had 47 family memb", 'publish_date': None}),
 Document(page_content='(Bloomberg) -- Boeing Co. faces a deepening crisis of confidence after an engineer at the US planemaker alleged the company took manufacturing shortcuts on its 787 Dreamliner aircraft in order to ease production bottlenecks of its most advanced airliner.\n\nMost Read from Bloomberg\n\nFactory workers wrongly measured and filled gaps that can occur when airframe segments of the 787 are joined together, according to Sam Salehpour, a longtime Boeing employee who made his concerns public on Tuesday. That assembly process could create “significant fatigue” in the composite material of the barrel sections and impair the structural integrity of more than 1,000 of the widebody jets in service, he said.\n\nADVERTISEMENT\n\nSalehpour, who according to his attorneys at Katz Banks Kumin LLP in Washington worked on the 787 from 2020 through early 2022, said the issues he described “may dramatically reduce the life of the plane.” Boeing disputes the allegations.\n\n“In a mad rush to reduce the backlog of the planes and get them to market, Boeing did not follow its own engineering requirements,” the engineer said on a conference call with reporters and his lawyers.\n\nThe claims risk opening another flank at the embattled planemaker, which is already facing intense scrutiny of its manufacturing and quality practices since a fuselage panel blew off a nearly new 737 Max 9 shortly after takeoff on Jan. 5. The allegations now extend the spotlight to the Dreamliner, a critical source of cash for Boeing as 737 output remains muted under close oversight by the US Federal Aviation Administration.\n\nAfter the allegations were made public, Senator Richard Blumenthal, a Connecticut Democrat, announced that he had asked Boeing’s departing Chief Executive Officer Dave Calhoun to appear at an April 17 subcommittee hearing called to examine the planemaker’s safety culture. Calhoun last month announced that he’s stepping down by the end of the year, part of a wider management shakeup in the wake of the Jan. 5 accident.\n\nStory continues\n\nHalted Deliveries\n\n“Boeing understands the important oversight responsibilities of the subcommittee and we are cooperating with this inquiry,” the company said, when asked if Calhoun or other executives planned to testify. “We have offered to provide documents, testimony and technical briefings, and are in discussions with the subcommittee regarding next steps.”\n\nIn separate statements, Boeing disputed Salehpour’s account. The company noted it had halted 787 deliveries for nearly two years earlier this decade under close FAA supervision after it found a spate of tiny structural imperfections in the joints where the carbon-fiber barrel sections are bolted together.\n\n“These claims about the structural integrity of the 787 are inaccurate and do not represent the comprehensive work Boeing has done to ensure the quality and long-term safety of the aircraft,” the planemaker said in a statement responding to the allegations, which were reported earlier Tuesday by the New York Times.\n\nCompany engineers are “completing exhaustive analysis to determine any long-term inspection and maintenance required, with oversight from the FAA,” Boeing said.\n\nThe latest allegations cast Boeing in an unfavorable light as it grapples with a crisis of confidence after the Jan. 5 panel blowout. While nobody on that flight was seriously hurt, the issue has put the spotlight on Boeing’s manufacturing and safety procedures and has led to a wholesale makeover of senior management. The crisis has jolted investors as well.\n\nBoeing shares fell 1.9% Tuesday, extending their decline to almost 32% this year, the worst performance on the Dow Jones Industrial Average.\n\nSenate Hearing\n\nSalehpour plans to discuss the manufacturing shortfalls he witnessed during the hearing before the Senate Permanent Subcommittee on Investigations scheduled for April 17.\n\nIn a March 19 letter to Calhoun, Blumenthal and Wisconsin’s Ron Johnson, the panel’s top-ranking Republican, asked for Boeing’s “immediate cooperation” with the panel’s review of Salehpour’s allegations.\n\nSalehpour’s attorneys flagged the issues to the FAA in a whistleblower letter dated Jan. 19. The agency has launched an investigation and interviewed Salehpour, his attorneys said, adding that other whistleblowers have come forward.\n\n“Voluntary reporting without fear of reprisal is a critical component in aviation safety,” the FAA said in a statement. “We strongly encourage everyone in the aviation industry to share information. We thoroughly investigate all reports.”\n\nBoeing disputed his attorney’s suggestion that in-service 787 Dreamliners face shorter commercial lives. The company said it has confirmed the finding through extensive fatigue testing, including forces that were more than 10 times the maximum allowed in production.\n\nSalehpour said he flagged his concerns to Boeing management but was ignored and ultimately transferred to the 777 program. There he claims to have also witnessed improper production practices after the planemaker ripped out a flawed robotic system without properly redesigning the relevant parts to match the new assembly process.\n\nIn a separate statement, Boeing said the claims were also inaccurate. “We are fully confident in the safety and durability of the 777 family,” the company said.\n\n(Adds whistleblower comment in fourth paragraph.)\n\nMost Read from Bloomberg Businessweek\n\n©2024 Bloomberg L.P.', metadata={'title': 'Boeing Crisis of Confidence Deepens With 787 Now Under Scrutiny', 'link': 'https://ca.finance.yahoo.com/news/boeing-shares-fall-nyt-report-164509069.html', 'authors': ['Siddharth Philip', 'Julie Johnsson'], 'language': 'en', 'description': '(Bloomberg) -- Boeing Co. faces a deepening crisis of confidence after an engineer at the US planemaker alleged the company took manufacturing shortcuts on its 787 Dreamliner aircraft in order to ease production bottlenecks of its most advanced airliner.Most Read from BloombergUS Slams Strikes on Russia Oil Refineries as Risk to Oil MarketsChinese Cement Maker Halted After 99% Crash in 15 MinutesBond Trader Places Record Futures Bet on Eve of Inflation DataApple’s India iPhone Output Hits $14 Bi', 'publish_date': None}),
 Document(page_content='Restaurants on the eclipse\'s path of totality saw a jump in sales on Monday as people flocked to find the best spots to see the celestial event, according to sales data from payments technology company Square.\n\n"The eclipse was genuinely a unique cosmic event, but also a unique event for commerce," said Ara Kharazian, research and data lead at Square.\n\nSquare said Tuesday restaurants that use its technology in Niagara Falls, Ont., which saw a huge influx of visitors for the eclipse, saw 404 per cent higher sales than the average Monday in 2024.\n\nHamilton, Ont., saw a 67 per cent jump, while Montreal restaurants saw sales rise 55 per cent.\n\nADVERTISEMENT\n\nThe increases mirrored a similar pattern in the U.S., where some counties saw restaurant sales rise by more than 500 per cent.\n\nThis kind of spike in spending is almost as rare as an eclipse, Kharazian said, as multiple places across North America saw sales skyrocket.\n\n"It’s simultaneously expected and also pretty stunning," he said.\n\n"It is genuinely rare to see this level of highly localized and also supercharged level of spending."\n\nThe spike gave a welcome boost to restaurants, which are highly seasonal, he said.\n\n"It wasn\'t a one-day event, we know that we saw some spending leading up to the weekend as well," he added.\n\n"That excess sales revenue is going to be very helpful to a business ... We hear often from restaurants who report that some of their biggest challenges have to do with access to cash and liquidity."\n\nMunicipalities across Central and Eastern Canada spent months preparing for the brief window of time in which the sun, Earth and moon aligned on Monday afternoon.\n\nDemand for hotels and short-term rentals surged for the weekend ahead of the eclipse, while municipalities planned events centred around the phenomenon.\n\nA February report from Airbnb said Montreal and the Niagara Region were among the most popular cities on its platform along the path of totality.\n\nStory continues\n\nMunicipalities like Hamilton and Niagara Falls urged visitors to plan ahead, anticipating heavy traffic and high demand. The mayor of Niagara Falls said about a million people were expected to fill the city, the largest crowd in its history.\n\n"It was not a typical Monday in April, that’s for sure," said Janice Thomson, president and CEO of Niagara Falls Tourism.\n\nThough Niagara Falls is a yearlong tourist destination, Thomson said she\'d never experienced anything like what happened on Monday, with a huge crowd of people cheering every time the clouds parted.\n\nCanada\'s telecom companies deployed additional infrastructure in preparation for large crowds of people. In a statement, Rogers said its network handled more than six times the amount of traffic it normally does in Niagara Falls, thanks in part to portable mobile towers.\n\nThis report by The Canadian Press was first published April 9, 2024.\n\nRosa Saba, The Canadian Press', metadata={'title': "Restaurants along eclipse's path of totality saw sales boom: Square", 'link': 'https://ca.finance.yahoo.com/news/restaurants-along-eclipses-path-totality-153650201.html', 'authors': ['The Canadian Press'], 'language': 'en', 'description': 'Restaurants on the eclipse\'s path of totality saw a jump in sales on Monday as people flocked to find the best spots to see the celestial event, according to sales data from payments technology company Square. "The eclipse was genuinely a unique cosmic event, but also a unique event for commerce," said Ara Kharazian, research and data lead at Square. Square said Tuesday restaurants that use its technology in Niagara Falls, Ont., which saw a huge influx of visitors for the eclipse, saw 404 per ce', 'publish_date': None})]

df = pd.DataFrame(
    [{"title": d.metadata["title"], "text": d.page_content} for d in data]
)

df

	title	text
0	Higher oil doesn't shake industry talk on spen...	TORONTO — The price of oil has been on a stead...
1	Venezuela Detains Former Maduro Confidantes in...	(Bloomberg) -- Venezuela detained former oil a...
2	Norfolk Southern agrees to $600M settlement in...	Norfolk Southern has agreed to pay $600 millio...
3	Boeing Crisis of Confidence Deepens With 787 N...	(Bloomberg) -- Boeing Co. faces a deepening cr...
4	Restaurants along eclipse's path of totality s...	Restaurants on the eclipse's path of totality ...

Dictionnary-based approach

A dictionary-based approach is a simple way to perform sentiment analysis. It consists of using a list of words with associated sentiment scores. The sentiment score of a sentence is the sum of the sentiment scores of the words it contains. You can also normalize the score by the number of words in the sentence.

For this type of approach, we will need to use a financial sentiment dictionary. I will use the Loughran-McDonald dictionary, which is the most commonly used used in finance reasearch. Note that this dictionary is designed for financial statements and earnings calls, so it may not be the best choice for news articles.

You can download the dictionary from here (direct link: Loughran-McDonald_MasterDictionary_1993-2023.csv)

Pre-processing

The preprocessing steps to use a dictionnary based approach depend on the dictionnary you are using and the final measure you want to obtain. In this case, we will use the Loughran-McDonald dictionary, which contains variation of similar words, including plural forms, verb forms, etc. Therefore, we do not need to perform stemming or lemmatization, a common step in text preprocessing.

The preprocessing steps we will perform are:

Lowercasing
Removing punctuation
Removing stopwords (frequent words that do not carry much meaning)
Removing numbers

The removal of stopwords and numbers is optional, but it will affect the sentiment score of the text as measure as a ratio of the number of words in the text. Other common filtering includes removing URLs, emails, cities, company names, etc.

# Using the same ones as Loughran-McDonald
with open("stopwords.txt", "r") as f:
    stopwords = f.read().split("\n")[:-1]
stopwords[:10]

['about', 'and', 'from', 'now', 'where', 'you', 'am', 'until', 'them', 'in']

def preprocess_text(text):
    words = text.split()
    words = [w.lower() for w in words]
    words = [w for w in words if w not in stopwords]
    # Remove punctuation and numbers
    words = [w for w in words if w.isalpha()]
    return " ".join(words)


df["text_clean"] = df["text"].apply(preprocess_text)
df

	title	text	text_clean
0	Higher oil doesn't shake industry talk on spen...	TORONTO — The price of oil has been on a stead...	toronto price oil a steady climb talk biggest ...
1	Venezuela Detains Former Maduro Confidantes in...	(Bloomberg) -- Venezuela detained former oil a...	venezuela detained former oil finance minister...
2	Norfolk Southern agrees to $600M settlement in...	Norfolk Southern has agreed to pay $600 millio...	norfolk southern agreed pay million a lawsuit ...
3	Boeing Crisis of Confidence Deepens With 787 N...	(Bloomberg) -- Boeing Co. faces a deepening cr...	boeing faces a deepening crisis confidence eng...
4	Restaurants along eclipse's path of totality s...	Restaurants on the eclipse's path of totality ...	restaurants path totality saw a jump sales mon...

After preprocessing, each article is a string containing only lower case words separated by spaces.

Dictionary

Next, will load the dictionary and make a list of positive words and a list of negative words.

lm_dict = pd.read_csv("Loughran-McDonald_MasterDictionary_1993-2023.csv")
lm_dict

	Word	Seq_num	Word Count	Word Proportion	Average Proportion	Std Dev	Doc Count	Negative	Positive	Uncertainty	Litigious	Strong_Modal	Weak_Modal	Constraining	Complexity	Syllables	Source
0	AARDVARK	1	664	2.690000e-08	1.860000e-08	4.050000e-06	131	0	0	0	0	0	0	0	0	2	12of12inf
1	AARDVARKS	2	3	1.210000e-10	8.230000e-12	9.020000e-09	1	0	0	0	0	0	0	0	0	2	12of12inf
2	ABACI	3	9	3.640000e-10	1.110000e-10	5.160000e-08	7	0	0	0	0	0	0	0	0	3	12of12inf
3	ABACK	4	29	1.170000e-09	6.330000e-10	1.560000e-07	28	0	0	0	0	0	0	0	0	2	12of12inf
4	ABACUS	5	9349	3.790000e-07	3.830000e-07	3.460000e-05	1239	0	0	0	0	0	0	0	0	3	12of12inf
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
86548	ZYGOTE	86529	69	2.790000e-09	1.310000e-09	2.940000e-07	45	0	0	0	0	0	0	0	0	2	12of12inf
86549	ZYGOTES	86530	1	4.050000e-11	1.720000e-11	1.880000e-08	1	0	0	0	0	0	0	0	0	2	12of12inf
86550	ZYGOTIC	86531	0	0.000000e+00	0.000000e+00	0.000000e+00	0	0	0	0	0	0	0	0	0	3	12of12inf
86551	ZYMURGIES	86532	0	0.000000e+00	0.000000e+00	0.000000e+00	0	0	0	0	0	0	0	0	0	3	12of12inf
86552	ZYMURGY	86533	0	0.000000e+00	0.000000e+00	0.000000e+00	0	0	0	0	0	0	0	0	0	3	12of12inf

86553 rows × 17 columns

pos_words = lm_dict[lm_dict["Positive"] != 0]["Word"].str.lower().to_list()
neg_words = lm_dict[lm_dict["Negative"] != 0]["Word"].str.lower().to_list()

pos_words[:10]

['able',
 'abundance',
 'abundant',
 'acclaimed',
 'accomplish',
 'accomplished',
 'accomplishes',
 'accomplishing',
 'accomplishment',
 'accomplishments']

neg_words[:10]

['abandon',
 'abandoned',
 'abandoning',
 'abandonment',
 'abandonments',
 'abandons',
 'abdicated',
 'abdicates',
 'abdicating',
 'abdication']

Sentiment score

We first need to compute the total number of words, and the number of positive and negative words in each article.

df["n"] = df["text_clean"].apply(lambda x: len(x.split()))
df["n_pos"] = df["text_clean"].apply(
    lambda x: len([w for w in x.split() if w in pos_words])
)
df["n_neg"] = df["text_clean"].apply(
    lambda x: len([w for w in x.split() if w in neg_words])
)

df

	title	text	text_clean	n	n_pos	n_neg
0	Higher oil doesn't shake industry talk on spen...	TORONTO — The price of oil has been on a stead...	toronto price oil a steady climb talk biggest ...	261	1	7
1	Venezuela Detains Former Maduro Confidantes in...	(Bloomberg) -- Venezuela detained former oil a...	venezuela detained former oil finance minister...	179	1	12
2	Norfolk Southern agrees to $600M settlement in...	Norfolk Southern has agreed to pay $600 millio...	norfolk southern agreed pay million a lawsuit ...	578	8	31
3	Boeing Crisis of Confidence Deepens With 787 N...	(Bloomberg) -- Boeing Co. faces a deepening cr...	boeing faces a deepening crisis confidence eng...	402	3	38
4	Restaurants along eclipse's path of totality s...	Restaurants on the eclipse's path of totality ...	restaurants path totality saw a jump sales mon...	255	4	1

The sentiment of a text is the sum of the sentiment scores of the words it contains (I call this the level). We can also normalize the score by the number of words in the text or by the number of sentiment words in the text. If we want to end up with a categorical classification, we can use a threshold to determine if the text is positive, negative or neutral.

df["lm_level"] = df["n_pos"] - df["n_neg"]

df["lm_score1"] = (df["n_pos"] - df["n_neg"]) / df["n"]
df["lm_score2"] = (df["n_pos"] - df["n_neg"]) / (df["n_pos"] + df["n_neg"])

CUTOFF = 0.3
df["lm_sentiment"] = df["lm_score2"].apply(
    lambda x: "positive" if x > CUTOFF else "negative" if x < -CUTOFF else "neutral"
)
df

	title	text	text_clean	n	n_pos	n_neg	lm_level	lm_score1	lm_score2	lm_sentiment
0	Higher oil doesn't shake industry talk on spen...	TORONTO — The price of oil has been on a stead...	toronto price oil a steady climb talk biggest ...	261	1	7	-6	-0.022989	-0.750000	negative
1	Venezuela Detains Former Maduro Confidantes in...	(Bloomberg) -- Venezuela detained former oil a...	venezuela detained former oil finance minister...	179	1	12	-11	-0.061453	-0.846154	negative
2	Norfolk Southern agrees to $600M settlement in...	Norfolk Southern has agreed to pay $600 millio...	norfolk southern agreed pay million a lawsuit ...	578	8	31	-23	-0.039792	-0.589744	negative
3	Boeing Crisis of Confidence Deepens With 787 N...	(Bloomberg) -- Boeing Co. faces a deepening cr...	boeing faces a deepening crisis confidence eng...	402	3	38	-35	-0.087065	-0.853659	negative
4	Restaurants along eclipse's path of totality s...	Restaurants on the eclipse's path of totality ...	restaurants path totality saw a jump sales mon...	255	4	1	3	0.011765	0.600000	positive

FinBert

BERT-based models are often called “state-of-the-art models” in recent papers in the finance even if the original Bert paper dates from 2018 and many more advanced models have come along since. They are pre-trained on a large corpus of text and fine-tuned on a specific task. In this tutorial, we will use FinBert, a BERT model fine-tuned on financial data (see FinBert paper).

The way these models work is by taking a sequence of tokens as input and generating a vector embedding for the text, which is a representation of the information in the text. This vector can be used as input to a classifier to predict the sentiment of the text. The FinBert model is trained to output softmax outputs (ie, probabilities) for three classes: positive, negative, and neutral.

Pre-processing

I won’t perform any pre-processing on the text before feeding it to the model. The model will take care of tokenizing the text and converting it to a sequence of tokens. Common pre-processing for Bert models include masking some words (date, company names, etc.), but I won’t do that here.

Usage

Many Bert models, including FinBert, are available in the HuggingFace Transformers library and can be fetched automatically.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import scipy
import torch

tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

To estimate the sentiment of a text, we will use a tokenizer to convert the text to a sequence of tokens, and then feed the tokens to the model to get the softmax outputs. This tokenizer will also pad or truncate the sequence to a fixed length if needed. The sentiment of the text is the class with the highest probability, which we can compute by applying the softmax function to the output of the model. We will also extract the probability associated with each class. Finally, we put the code behind a with torch.no_grad(): block to avoid computing gradients, as we are only interested in the output of the model, not training the model.

def finbert_sentiment(text: str) -> tuple[float, float, float, str]:
    with torch.no_grad():
        inputs = tokenizer(
            text, return_tensors="pt", padding=True, truncation=True, max_length=512
        )
        outputs = model(**inputs)
        logits = outputs.logits
        scores = {
            k: v
            for k, v in zip(
                model.config.id2label.values(),
                scipy.special.softmax(logits.numpy().squeeze()),
            )
        }
        return (
            scores["positive"],
            scores["negative"],
            scores["neutral"],
            max(scores, key=scores.get),
        )

We apply our sentiment function to each article in the DataFrame to get the sentiment of each article using the apply() method. We then apply the pd.Series constructor to convert the returned tuples so that we can assign the results to new columns in the DataFrame. Finally, we can also get a numerical score by taking the difference between the positive and negative probabilities.

# Notice that this is the raw text, no preprocessing
df[["finbert_pos", "finbert_neg", "finbert_neu", "finbert_sentiment"]] = (
    df["text"].apply(finbert_sentiment).apply(pd.Series)
)
df["finbert_score"] = df["finbert_pos"] - df["finbert_neg"]

df[
    [
        "title",
        "text",
        "finbert_pos",
        "finbert_neg",
        "finbert_neu",
        "finbert_sentiment",
        "finbert_score",
    ]
]

	title	text	finbert_pos	finbert_neg	finbert_neu	finbert_sentiment	finbert_score
0	Higher oil doesn't shake industry talk on spen...	TORONTO — The price of oil has been on a stead...	0.363470	0.054056	0.582474	neutral	0.309414
1	Venezuela Detains Former Maduro Confidantes in...	(Bloomberg) -- Venezuela detained former oil a...	0.022471	0.666017	0.311512	negative	-0.643546
2	Norfolk Southern agrees to $600M settlement in...	Norfolk Southern has agreed to pay $600 millio...	0.030382	0.732271	0.237347	negative	-0.701889
3	Boeing Crisis of Confidence Deepens With 787 N...	(Bloomberg) -- Boeing Co. faces a deepening cr...	0.009635	0.949322	0.041043	negative	-0.939687
4	Restaurants along eclipse's path of totality s...	Restaurants on the eclipse's path of totality ...	0.783017	0.038716	0.178267	positive	0.744301

LLM models

LLM models are large language models that are trained on a large corpus of text. They are often used for text generation, but they can also be used for sentiment analysis. The approach is to design a prompt that will make the model output a sentiment score. Langchain is a library that makes it easy to use LLM models for different tasks, including sentiment analysis.

Ollama

Ollama is a tool to manage and run local LLMs, such as Meta’s Llama2 and Mistral’s Mixtral. I discussed how to use Ollama as a private, local ChatGPT replacement in a previous post.

For this example, I will be using two open-source LLM models served locally by Ollama. The process would be similar if you were using other models such as OpenAI’s GPT-3.5 or GPT-4. The main difference in execution is that Langchain supports OpenAI’s function API which insures that the output is properly formatted, which as we will see can be an issue with Llama2 and Mixtral, the two models I will use.

The first step in setting up Ollama is to download and install the tool on your local machine. The installation process is straightforward and involves running a few commands in your terminal. Ollama’s download page provides installers for macOS and Windows, as well as instructions for Linux users. Once you’ve downloaded the installer, follow the installation instructions to set up Ollama on your machine.

If you’re using a Mac, you can install Ollama using Homebrew by running the following command in your terminal:

brew install ollama

The benefit of using Homebrew is that it simplifies the installation process and also sets up Ollama as a service, allowing it to run in the background and manage the LLM models you download.

At the moment, the most popular code models on Ollama are:

After installing Ollama, you can install a model from the command line using the pull command:

ollama pull mixtral

Langchain

Langchain is a Python library that provides a simple interface to interact with LLM models. Specifically, our approach will be the following:

Define the desired output format using Langchain’s support for Pydantic models.
Define a prompt template that will be used to generate the prompt for the model.
Define the chat model that we want to use.
Finally, combine the prompt, the model, and the output format to get the sentiment of the text.

In case the returned output is not in the expected format, we can use the tenacity library to retry the request a few times until we get the expected output. Langchain also provides various features to handle bad outputs which I suggest you explore once you are comfortable with the basics.

from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import PydanticOutputParser

from langchain_community.chat_models import ChatOllama

# For handling errors
from tenacity import retry, stop_after_attempt, RetryError

We will first define the desired output format as a Pydantic model, from which we will create a PydanticOutputParser object. This object will be used to inject output definition into the prompt and to parse the output of the model.

class SentimentClassification(BaseModel):
    sentiment: str = Field(
        ...,
        description="The sentiment of the text",
        enum=["positive", "negative", "neutral"],
    )
    score: float = Field(..., description="The score of the sentiment", ge=-1, le=1)
    justification: str = Field(..., description="The justification of the sentiment")
    main_entity: str = Field(..., description="The main entity discussed in the text")

We then define our llm_sentiment() function that will create the output parser, the prompt, and the model and combine them in a chain. We will then wrap the call to chain.invoke() in order to handle bad outputs and retry the request a few times until we get the expected output with the @retry decorator from the tenacity library.

@retry(stop=stop_after_attempt(5))
def run_chain(text: str, chain) -> dict:
    return chain.invoke({"news": text}).dict()


def llm_sentiment(text: str, llm) -> tuple[str, float, str, str]:
    parser = PydanticOutputParser(pydantic_object=SentimentClassification)

    prompt = PromptTemplate(
        template="Describe the sentiment of a text of financial news.\n{format_instructions}\n{news}\n",
        input_variables=["news"],
        partial_variables={"format_instructions": parser.get_format_instructions()},
    )

    chain = prompt | llm | parser

    try:
        result = run_chain(text, chain)

        return (
            result["sentiment"],
            result["score"],
            result["justification"],
            result["main_entity"],
        )
    except RetryError as e:
        print(f"Error: {e}")
        return "error", 0, "", ""

Finally, we apply our sentiment function to each article in the DataFrame to get the sentiment of each article using the apply() method.

# Replace with the correct model, or use ChatOpenAI if you want to use OpenAI
llama2 = ChatOllama(model="llama2", temperature=0.1)

df[
    ["llama2_sentiment", "llama2_score", "llama2_justification", "llama2_main_entity"]
] = (df["text"].apply(lambda x: llm_sentiment(x, llama2)).apply(pd.Series))

Error: RetryError[<Future at 0x329c35e50 state=finished raised OutputParserException>]
Error: RetryError[<Future at 0x329c629c0 state=finished raised OutputParserException>]
Error: RetryError[<Future at 0x3295c7680 state=finished raised OutputParserException>]

df[
    [
        "title",
        "text",
        "llama2_sentiment",
        "llama2_score",
        "llama2_justification",
        "llama2_main_entity",
    ]
]

	title	text	llama2_sentiment	llama2_score	llama2_justification	llama2_main_entity
0	Higher oil doesn't shake industry talk on spen...	TORONTO — The price of oil has been on a stead...	positive	0.7	The article focuses on the oil and gas industr...	Suncor Energy Inc.
1	Venezuela Detains Former Maduro Confidantes in...	(Bloomberg) -- Venezuela detained former oil a...	error	0.0
2	Norfolk Southern agrees to $600M settlement in...	Norfolk Southern has agreed to pay $600 millio...	error	0.0
3	Boeing Crisis of Confidence Deepens With 787 N...	(Bloomberg) -- Boeing Co. faces a deepening cr...	negative	0.7	The article reports on allegations of manufact...	Boeing Co.
4	Restaurants along eclipse's path of totality s...	Restaurants on the eclipse's path of totality ...	error	0.0

As we can see, Llama 2 struggled to provide a valid result for some of the articles, even after 3 tries. This is a common issue with LLMs, and it is important to handle bad outputs properly. I will check if I get better results with Mixtral (more specifically dolphin-mixtral, a fine-tuned version of Mixtral), another LLM model available on Ollama.

mixtral = ChatOllama(model="dolphin-mixtral:latest", temperature=0.1)

df[
    [
        "mixtral_sentiment",
        "mixtral_score",
        "mixtral_justification",
        "mixtral_main_entity",
    ]
] = (
    df["text"].apply(lambda x: llm_sentiment(x, mixtral)).apply(pd.Series)
)

Error: RetryError[<Future at 0x329d5b320 state=finished raised OutputParserException>]

df[
    [
        "title",
        "text",
        "mixtral_sentiment",
        "mixtral_score",
        "mixtral_justification",
        "mixtral_main_entity",
    ]
]

	title	text	mixtral_sentiment	mixtral_score	mixtral_justification	mixtral_main_entity
0	Higher oil doesn't shake industry talk on spen...	TORONTO — The price of oil has been on a stead...	neutral	0.00	The text discusses the focus on spending disci...	Canadian Association of Petroleum Producers co...
1	Venezuela Detains Former Maduro Confidantes in...	(Bloomberg) -- Venezuela detained former oil a...	negative	-0.80	The text discusses the detention of former oil...	Tareck El Aissami and Simón Zerpa
2	Norfolk Southern agrees to $600M settlement in...	Norfolk Southern has agreed to pay $600 millio...	error	0.00
3	Boeing Crisis of Confidence Deepens With 787 N...	(Bloomberg) -- Boeing Co. faces a deepening cr...	negative	-0.80	The text discusses a crisis of confidence for ...	Boeing
4	Restaurants along eclipse's path of totality s...	Restaurants on the eclipse's path of totality ...	positive	0.85	The text discusses a significant increase in s...	eclipse

Mixtral managed to provide a valid result for all the articles but one, which is an improvement over Llama 2. One advantage of LLMs is that they can provide not only a sentiment score but also a summary of the reasoning behind the score. This can be useful for understanding the model’s decision-making process and for debugging bad outputs.

import textwrap

print(textwrap.fill(df.iloc[0]["text"][:500] + "...") + "\n")
print("Llama2: " + textwrap.fill(df.iloc[0]["llama2_justification"]) + "\n")
print("Mixtral: " + textwrap.fill(df.iloc[0]["mixtral_justification"]))

TORONTO — The price of oil has been on a steady climb all year, but
the talk at Canada's biggest oil and gas conference is still focused
on spending discipline.  Industry leaders at the Canadian Association
of Petroleum Producers conference, held in Toronto this year, have
been emphasizing their predictability and focus on returning money to
shareholders, rather than talk of growth.  Suncor Energy Inc. chief
executive Rich Kruger, who was named head of the oil and gas producer
last year as it st...

Llama2: The article focuses on the oil and gas industry in Canada and how
companies are shifting their priorities from growth to spending
discipline. The CEOs of Suncor Energy, Cenovus Energy, and Whitecap
Resources emphasize the importance of returning money to shareholders
rather than talking about growth. The article highlights the changing
narrative around oil demand and how it's still growing despite
investor concerns. The overall tone of the article is positive as
companies are taking a more disciplined approach to their operations.

Mixtral: The text discusses the focus on spending discipline and predictability
in the Canadian oil and gas industry, with leaders emphasizing their
goal of returning money to shareholders rather than pursuing growth.
The sentiment is neutral as it does not express strong positive or
negative emotions.

Conclusion

Overall, each class of models has its own strengths and weaknesses. Dictionary-based approaches are simple and interpretable, but they may not be as accurate as more sophisticated models. BERT-based models like FinBert are more accurate and can handle more complex text, but they require more computational resources than simple word-counting methods. LLMs like Llama 2 and Mixtral can provide detailed reasoning behind their predictions, but they can be slow, more costly to run, and may not always provide valid outputs.

It is worth exploring various approaches to see which one works best for your specific use case. Note also that when it comes to dictionary-based approaches and BERT-based models, the dictionaries and models are usually designed for a specific type of text, so it is important to use the right one for your use case.

df[
    [
        "title",
        "text",
        "lm_sentiment",
        "finbert_sentiment",
        "llama2_sentiment",
        "mixtral_sentiment",
    ]
]

	title	text	lm_sentiment	finbert_sentiment	llama2_sentiment	mixtral_sentiment
0	Higher oil doesn't shake industry talk on spen...	TORONTO — The price of oil has been on a stead...	negative	neutral	positive	neutral
1	Venezuela Detains Former Maduro Confidantes in...	(Bloomberg) -- Venezuela detained former oil a...	negative	negative	error	negative
2	Norfolk Southern agrees to $600M settlement in...	Norfolk Southern has agreed to pay $600 millio...	negative	negative	error	error
3	Boeing Crisis of Confidence Deepens With 787 N...	(Bloomberg) -- Boeing Co. faces a deepening cr...	negative	negative	negative	negative
4	Restaurants along eclipse's path of totality s...	Restaurants on the eclipse's path of totality ...	positive	positive	error	positive

References

Reuse

CC BY-NC-SA 4.0