Zum Inhalt springen

The Data Tank & BeCode: A Capstone Project

by Alvaro Guijarro May

Brussels, Belgium: September 4, 2023

Data plays an increasingly central role in our lives. Access to that data, and the ability to understand, manipulate and re-use it responsibly, is becoming increasingly important for both employers and employees. The ability to interpret and tell a story from that data is a marketable skill. At The Data Tank, we recognize the significance of this ability, and empower others to discover and understand what data is saying. Too often, this skill is ignored in data education and then later revealed as urgent in the labor market.

We believe that accessing meaningful and trustworthy data and using/re-using it correctly is crucial to effectively address the pressing challenges of our time. We also believe students today are the Data Stewards of tomorrow. These Data Stewards will be key in identifying questions and challenges where data can add valuable insights.

To this effect, The Data Tank (TDT) joined forces with BeCode (be</code>), a Belgian nonprofit organization, in an innovative learning project to gauge the public’s attitudes toward data, data re-use, and associated subjects. This project, a curriculum capstone, featured sentiment analysis from Belgian newspapers relating to data and data-related topics. 

Why BeCode?

BeCode focuses on enabling tomorrow’s digital talents to blossom as a “social impact-driven digital skills and coding school, using an active pedagogy to teach in-demand digital skills to motivated individuals in vulnerable professional situations”. Most importantly, BeCode helps this untapped source get greater access to the job market. 

With this project, TDT and BeCode explored using data and data analysis in order to better understand, and refine communications with, the general public regarding the data ecosystem. 

What is sentiment analysis?

Sentiment analysis is the process of analysing digital text to determine if the emotional tone of the message is positive, negative, or neutral. It can venture into the large volumes of text data like emails, support chat transcripts, social media comments, and reviews. It is a relatively effective and proven tool to gauge the public’s perception of topics of interest. As a basis, database searches, in this case Belgian media, and the resulting sentiment analysis assigns a numerical value, usually -1 (negative), 0 (neutral), and 1 (positive), to digital text in order to be able to qualify non-numerical data. 

Why is this important?

Nearly anyone will immediately recognize that this research will most likely produce a bell curve, the graphical depiction of a normal probability distribution. While the exercise would most likely not produce groundbreaking results, what matters to our story is the process, to see how things are done and to go deeper in the future. That is learning. 

Project: Sentiment Analysis on Data-related topics

During their training with BeCode, students work together on various capstone projects that test their newly acquired skills as Data Scientists, Data Analysts, and  Data Engineers. For their cooperative effort model with TDT, students undertook the task of developing a sentiment analysis model. This model scrutinised Belgian newspaper articles from early 2020 to August 2023. Specifically, articles referencing data, data re-use, and related subjects. 

Methodology

Students enrolled in BeCode’s training were given a database created with readily-available newspaper articles’ information from the top 20 Belgian newspapers. Over three million data points were compiled from early 2020 until August 2023, compiling urls, date, article texts, and titles. From this base, students created a data processing workflow in order to identify articles for the analysis based on the following topics:

Data re-use

Data reusability

Data use

Data sharing

Data access

Data protection

Data privacy

After completion of the base, the group could proceed with an Exploratory Data Analysis (EDA), where they were able to come up with data insights. Afterwards, the students performed a Sentiment Analysis on the remaining dataset. The newly curated dataset contained roughly 9,000 French (predominantly) and Dutch language articles that suited the topic requirement. 

After completion of the base, the group could proceed with an Exploratory Data Analysis (EDA), where they were able to come up with data insights. Afterwards, the students performed a Sentiment Analysis on the remaining dataset. The newly curated dataset contained roughly 9,000 French (predominantly) and Dutch language articles that suited the topic requirement.

During their analysis, BeCode students’ unveiled an (expected) finding: Belgian newspaper articles tend to adopt a neutral tone in their portrayal of data-related subjects. The polarity scores, ranging predominantly between -0.25 and 0.25, reflect this equilibrium. While not groundbreaking, this type of learning exercise is of great interest to the TDT. It shows the importance of using data in order to understand and communicate effectively in and about the data ecosystem. This is a priority for the TDT. Future projects rely on data and data re-use, and the ability to work together with technical and specialized partners within such a structure.

Interestingly, as part of their analysis, students created word clouds that grouped words based on their positive or negative associations, which could be used to uncover insights. And there is the story: these word clouds provide a glimpse into how data is talked about in newspaper articles, showing that the discussion is much more complex than simplified “good” or “bad”. The word clouds reveal that the way data is presented in these articles goes beyond simple positivity or negativity – it is a mix of different human perspectives. This small detail gives us a deeper understanding of the overall context in which data-related topics are presented in the media, understood, and could be re-used by humans, which is what data re-use is all about.

“We are very grateful to The Data Tank for trusting us on this project: sentiment analysis is a delicate subject, especially with the latest years being so dense and eventful. As we are going towards extensively data-driven careers, this was an exceptionally interesting case-study.”

Grégoire Hupin, BeCode Data Analyst from the BeCode

TDT is also grateful for this opportunity to stretch our wings and to do something we do very well: being data-driven. 

Results

Working with BeCode encourages us to continue working with students in order to better understand and communicate the current data ecosystem. This exercise not only enables students but also is part of the story of unlocking data’s potential: gathering, accessing, and re-using it responsibly, to better equip us to tackle the pressing issues of our time. The accessibility to data and data sharing, crucial components of this analysis, underscore the significance of open data advocacy..

The cooperation between The Data Tank and BeCode not only shed a brief light on (expected) sentiment within Belgian newspaper articles, but also underscored the opportunities that informed data communication provides. Looking forward, The Data Tank is excited to push ahead with projects like this. Our story is to bring data-driven projects into the spotlight, so we can better understand how people access data, what people think about data, and to ensure data is re-used responsibly and sustainably.

Leave a Reply

Your email address will not be published. Required fields are marked *

The Data Tank
en_USEnglish