Skip to content

A NLP project that analyzes cyber risk sentiment from cybersecurity discussions on Reddit. The pipeline implements text preprocessing, sentiment analysis with VADER and DistilBERT models. The analysis generates actionable insights on emerging threats, supporting cyber risk intelligence through sentiment trends.

Notifications You must be signed in to change notification settings

Nakshjainsonigara/Cyber-Risk-Sentiment-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Cyber Risk Sentiment Analysis using Natural Language Processing

Project Overview

This project performs sentiment analysis on Reddit posts from cyber risk-related subreddits to gauge public sentiment on cyber risks, vulnerabilities, and incidents. Leveraging Natural Language Processing (NLP) techniques, including entity extraction to identify key cyber threats, it analyzes large volumes of unstructured data and provides insights into trending cyber threats, user sentiments, and potential areas of cyber risk. The results aim to support cybersecurity decision-making, threat intelligence, and risk management for organizations.

Objectives

  • Extract and analyze cybersecurity discussions from Reddit to assess public sentiment on cyber risk topics.
  • Use NLP techniques to process, analyze, and visualize unstructured data on cyber threats.
  • Generate actionable insights on emerging cyber risks, such as ransomware, data breaches, and vulnerabilities, to aid in cyber intelligence and risk assessment.

Methodology

  1. Data Collection
    Using the Reddit API (PRAW), data is collected from selected subreddits focused on cybersecurity, including:

    • cybersecurity
    • netsec
    • infosec
    • AskNetsec
    • ThreatIntel
  2. Data Preprocessing
    Text data undergoes preprocessing steps:

    • Removing stop words
    • Lemmatization to normalize words
    • Text cleaning with regular expressions
  3. Sentiment Analysis
    Two sentiment analysis methods are used:

    • VADER: A rule-based model optimized for social media text, assessing sentiment polarity.
    • BERT (DistilBERT): A transformer model fine-tuned for sentiment analysis, providing more nuanced sentiment classification.
  4. Feature Extraction and Vectorization

    • TF-IDF Vectorization for feature extraction.
    • Count Vectorizer to analyze word frequency and uncover common themes in cybersecurity discussions.
  5. Visualization

    • Word clouds to visualize commonly discussed topics.
    • Bar charts and sentiment distribution graphs to present insights into sentiment trends.
  6. Entity Extraction and Analysis

    • Entity Extraction: Named entities (such as organizations, technologies, and cyber threats) are extracted from the cleaned text using spaCy. This process identifies and isolates key terms relevant to cybersecurity.

    • Entity Analysis: After extraction, entities are categorized, filtered, and analyzed for frequency and associated sentiment (positive, negative, or neutral).


Technology Stack

  • Programming Language: Python
  • Data Collection: PRAW (Python Reddit API Wrapper)
  • Data Processing and NLP: NLTK, TextBlob, WordNetLemmatizer, Stopwords, re (regex),spaCy
  • Sentiment Analysis: VADER, Hugging Face Transformers (DistilBERT)
  • Feature Extraction: Scikit-learn (TF-IDF Vectorizer, Count Vectorizer)
  • Visualization: Matplotlib, WordCloud

Algorithms and Libraries Used

  • VADER Sentiment Analysis: A lexicon and rule-based model designed for social media sentiment analysis.
  • DistilBERT Sentiment Analysis: A transformer-based model fine-tuned on sentiment data (SST-2).
  • TF-IDF Vectorization: Calculates term frequency-inverse document frequency for feature extraction.
  • Count Vectorization: Extracts features based on word frequency, useful for identifying prominent themes in cybersecurity discussions.
  • Named Entity Recognition: Extracts relevant entities (e.g., threats, technologies) for focused analysis.

Insights and Findings

Sentiment analysis of cyber risk discussions on social media or forums can provide valuable insights for organizations operating in the cybersecurity domain. By analyzing how people talk about cybersecurity and cyber risk, organization can gain an understanding of current trends, emerging threats, and public perception, helping them make informed decisions.

  • Positive Sentiment: Positive posts highlight confidence in cybersecurity practices and solutions. These discussions often focus on successful implementations of security measures, the adoption of new technologies, or effective responses to cyber incidents. Organizations can track these posts to identify new tools, strategies, or practices that are helping reduce cyber risk. This helps companies stay updated on the latest advancements and share best practices with clients or stakeholders.

  • Negative Sentiment: Negative posts often reflect concerns or frustrations related to cybersecurity challenges, such as data breaches, increasing cyberattacks, or new vulnerabilities. By analyzing these posts, companies can stay informed about potential threats and risks in the industry. This information can be used to adjust security strategies, anticipate new attack vectors, and identify areas that require immediate attention. Negative sentiment also helps organizations understand what concerns the public or professionals, allowing them to address these issues proactively.

  • Neutral Sentiment: Neutral posts provide objective, fact-based information without strong emotions. These could include news articles, updates on cybersecurity policies, or technical discussions about vulnerabilities and solutions. Although neutral posts may not indicate any urgent problems, they offer valuable data about ongoing trends, research, or industry standards. Tracking neutral sentiment helps organizations keep track of non-urgent but important developments that may impact cyber risk in the long run.

From Model Comparision, We infer that:

  • VADER tends to categorize a larger portion of posts as Neutral, which might indicate that it is more conservative or cautious in assigning clear sentiment, especially with text that has mixed or subtle tones.

  • BERT, with its deep learning capabilities, provides more decisive sentiment labels, leaning towards Positive and Negative classifications, which suggests that it is better at capturing stronger sentiments or more polarized opinions in the data.

  • This difference highlights that VADER is better suited for general sentiment detection in informal contexts, while BERT’s ability to understand context in greater depth allows it to more confidently classify posts into distinct sentiment categories.

Model Positive (%) Neutral (%) Negative (%)
VADER 43.3% 42.8% 14.0%
BERT 48.2% N/A 51.8%

Note : BERT does not have a Neutral classification by default because it is typically fine-tuned for binary sentiment analysis (Positive/Negative). However, Neutral can be incorporated by defining a probability threshold where the model classifies results as Neutral when its confidence in both Positive and Negative is below a certain level.

The top entities reveal focused areas in cyber risk discussions:

  1. Pentagon : The negative sentiment classification from BERT suggests that discussions about the Pentagon are likely focused on potential cybersecurity risks or vulnerabilities.

  2. Cyber Threat Intelligence (CTI) : CTI is primarily discussed with negative sentiment, underscoring industry concerns related to intelligence on potential threats and vulnerabilities.

  3. RTM Locker Ransomware : This specific ransomware variant is mentioned with consistent negative sentiment, highlighting ongoing risks associated with ransomware attacks.


Conclusion

This project demonstrates the use of NLP in understanding public sentiment on cybersecurity risks. By identifying sentiment patterns and commonly discussed threats, organizations can gain insights that support proactive risk management and inform cybersecurity strategies. Future work may explore the integration of additional NLP models, such as BERT-based topic modeling, and expanded data sources for a more comprehensive risk intelligence assessment.


About

A NLP project that analyzes cyber risk sentiment from cybersecurity discussions on Reddit. The pipeline implements text preprocessing, sentiment analysis with VADER and DistilBERT models. The analysis generates actionable insights on emerging threats, supporting cyber risk intelligence through sentiment trends.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published