Real or Not? NLP with Disaster Tweets

Rohan Gupta
4 min readDec 13, 2020

Identifying Tweets About Real Disasters

In recent years, social media has gotten lots of attention for its potential use in spatial and temporal events. Twitter is one of them that has become an important communication channel in different situations, for instance, in times of emergency. Smartphones enable people to announce an emergency they see in real-time. Because of that, more agencies are interested in monitoring Twitter (e.g., disaster relief organizations(NDRF)). The idea behind this work is that we can take a large body of tweets and extract a useful summary that could have been useful in disaster situations.

The problem to be solved by this capstone project is how to identify which tweets are about ‘real disasters’ and which ones aren’t. For the obvious reason, this Project is treated Classification Problem.

Let’s take an example out of this Project. Now let’s say many of you have sometimes a problem with the Zomato service. and to complain about it you use Twitter to tweet about your problem. Now there are so many tweets in which someone is thanking the Zomato services and there are so many tweets in which someone is complaining about Zomato services. So, it’s very hard for customer care to find the real tweet about the complaint. So they use Machine learning behind this process to target the users having a complaint about their service to solve them.

The above tweets are just an example. The Left tweet is the Real disaster Tweet and the Right side tweet is Not a Real disaster tweet.

The process followed in the Project:

  • Loading Data
  • Data Preview
  • Exploratory Data Analysis
  • Data cleaning
  • Bag of Words — CountVectorizer, TFIDF(Term Frequency Inverse Document Frequency)
  • Text Classification Model
  • Evaluation & Prediction

Data Preview :

Top 5 rows of the Dataset

This dataset contains 7613 Tweet text and 5 columns. And about the following columns:

  • id — A unique identifier for each tweet
  • text — The text of the Tweet
  • location — The location of the tweet was sent from (may be blank).
  • keyword — A particular keyword from the tweet (may be blank).
  • target — in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0).

In this Dataset, there are 3271 tweets that are of real disaster (1) tweet whereas 4342 tweets which is not a disaster (0) tweet. In terms of missing values, the ‘location’ variables contain mostly missing values near 2533 rows and whereas for ‘keyword’ 61 rows of keywords are missing.

Exploratory Data Analysis:

  • Target Label Count in the training dataset :
  • Distribution of text length in comparison to target variable:
  • WordCloud:
Left Side WordCloud shows for Not Disaster words & Right side of WordCloud shows for Disaster words

Data Cleaning :

The following process in data cleaning I have done:

  • Removing special characters and digits.
  • Making all text to lower case.
  • Stopwords Removal.
  • Lemmatization
This is the top 5 rows from the Data after Data Lemmatization

Applying Different Machine Learning Algorithms:

On applying different machine learning algorithm on the dataset, I can say Logistic Regression with accuracy 80.25% was performing best for the model.

One of the Evaluation

Future Prediction:

For future prediction, I saved the Model and Vectorizer in Pickle file.

Sample of the Future Prediction

Conclusion :

In this project, I classify tweets into disaster tweets in real or not?.

  • First, I have analyzed and explored all the provided tweets data to visualize the statistical and other properties of the presented data.
  • Next, I performed some exploratory analysis of the data to check type of the data, whether there are unwanted features and if features have missing data. Based on the analysis.
  • The ‘text’ columns is all text data along with alpha numeric, special characters and embedded URLs.The ‘text’ column data needs to be cleaned and pre processed and vectorized before it can be used with a machine learning algorithm for the classification of the tweets.
  • After pre processing the train data, the data was vectorized using CountVectorizer and TFIDFfeatures.
  • Then various classifiers were fit on the data and predictions were made. Logistic Regression algo. fits are model best with less time complexity and with Accuracy Score of 80.25 %.

Check out the Project on:

For contact: LinkedIn, Twitter

--

--