Skild AI Raises $300 Million for Scalable AI Robotics Platform

Skild AI startup in robotics arean successfully secured $300 million in Series A funding round led by big names like Lightspeed Venture Partners, Coatue, SoftBank Group and Jeff Bezos through Bezos Expeditions. Other notable investors were Felicis Ventures, Sequoia Capital, Menlo Ventures, General Catalyst, CRV, Amazon, SV Angel and Carnegie Mellon University. Skild AI is […]

Data-Driven Decisions: Harnessing Big Data Analytics to Drive Business Success

In today’s rapidly evolving business landscape, data is more valuable than ever. It fuels decision-making processes, drives innovation, and provides a competitive edge. As organizations strive to stay ahead, the importance of big data analytics cannot be overstated. This article explores how businesses can harness the power of big data to drive success and whether […]

Startups and Investors Await Modi 3.0’s Union Budget 2024 with High Expectations

Indian Union Budget 2024 announcement is approaching. The startup ecosystem and investor community are expecting something favorable. They are looking ahead the scrapping of contentious angel tax and launching of some new schemes by the government to boost domestic investments. One of the most important issues in the upcoming budget is the angel tax. It […]

Exploring NLP Preprocessing Techniques: Stopwords, Bag of Words, and Word Cloud

Natural Language Processing (NLP) is a fascinating field that bridges the gap between human communication and machine understanding. One of the fundamental steps in NLP is text preprocessing, which transforms raw text data into a format that can be effectively analyzed and utilized by algorithms. In this blog, we’ll delve into three essential NLP preprocessing techniques: stopwords removal, bag of words, and word cloud generation. We’ll explore what each technique is, why it’s used, and how to implement it using Python. Let’s get started!

Stopwords Removal: Filtering Out the Noise

What Are Stopwords?

Stopwords are common words that carry little meaningful information and are often removed from text data during preprocessing. Examples include “the,” “is,” “in,” “and,” etc. Removing stopwords helps in focusing on the more significant words that contribute to the meaning of the text.

Why remove stopwords?

Stopwords are removed from:

  • Reduce the dimensionality of the text data.
  • Improve the efficiency and performance of NLP models.
  • Enhance the relevance of features extracted from the text.

Pros and Cons

Pros:

  • Simplifies the text data.
  • Reduces computational complexity.
  • Focuses on meaningful words.

Cons:

  • Risk of removing words that may carry context-specific importance.
  • Some NLP tasks may require stopwords for better understanding.

Implementation

Let’s see how we can remove stopwords using Python:

import nltk
from nltk.corpus import stopwords
# Download the stopwords dataset
nltk.download('stopwords')
# Sample text
text = "This is a simple example to demonstrate stopword removal in NLP."
Load the set of stopwords in English
stop_words = set(stopwords.words('english'))
Tokenize the text into individual words
words = text.split()
Remove stopwords from the text
filtered_text = [word for word in words if word.lower() is not in stop_words]
print("Original Text:", text)
print("Filtered Text:", " ".join(filtered_text))

Code Explanation

Importing Libraries:

import nltk from nltk.corpus import stopwords

We import thenltk library and the stopwords module fromnltk.corpus.

Downloading Stopwords:

nltk.download('stopwords')

This line downloads the stopwords dataset from the NLTK library, which includes a list of common stopwords for multiple languages.

Sample Text:

text = "This is a simple example to demonstrate stopword removal in NLP."

We define a sample text that we want to preprocess by removing stopwords.

Loading Stopwords:

stop_words = set(stopwords.words(‘english’))

We load the set of English stopwords into the variable stop_words.

Tokenizing Text:

words = text.split()

The split() method tokenizes the text into individual words.

Removing Stopwords:

filtered_text = [word for word in words if word.lower() is not in stop_words]

We use a list comprehension to filter out stopwords from the tokenized words. The lower() method ensures case insensitivity.

Printing Results:

print("Original Text:", text) print("Filtered Text:", ""). join(filtered_text))

Finally, we print the original text and the filtered text after removing stopwords.

Bag of Words: Representing Text Data as Vectors

What Is Bag of Words?

The Bag of Words (BoW) model is a technique to represent text data as vectors of word frequencies. Each document is represented as a vector where each dimension corresponds to a unique word in the corpus, and the value indicates the word’s frequency in the document.

Why Use Bag of Words?

bag of Words is used to:

  • Convert text data into numerical format for machine learning algorithms.
  • Capture the frequency of words, which can be useful for text classification and clustering tasks.

Pros and Cons

Pros:

  • Simple and easy to implement.
  • Effective for many text classification tasks.

Cons:

  • Ignores word order and context.
  • Can result in high-dimensional sparse vectors.

Implementation

Here’s how to implement the Bag of Words model using Python:

from sklearn.feature_extraction.text import CountVectorizer
# Sample documents
documents = [
'This is the first document',
'This document is the second document',
'And this is the third document.',
'Is this the first document?'
]
# Initialize CountVectorizer
vectorizer = CountVectorizer()
Fit and transform the documents
X = vectorizer.fit_transform(documents)
# Convert the result to an array
X_array = X.toarray()
# Get the feature names
feature_names = vectorizer.get_feature_names_out()
# Print the feature names and the Bag of Words representation
print("Feature Names:", feature_names)
print (Bag of Words: n", X_array)

Code Explanation

  • Importing Libraries:

from sklearn.feature_extraction.text import CountVectorizer

We import the CountVectorizer from the sklearn.feature_extraction.text module.

Sample Documents:

documents = [ ‘This is the first document’, ‘This document is the second document’, ‘And this is the third document.’, ‘Is this is the first document?’ ]

We define a list of sample documents to be processed.

Initializing CountVectorizer:

vectorizer = CountVectorizer()

We create an instance ofCountVectorizer.

Fitting and Transforming:

X = vectorizer.fit_transform(documents)

Thefit_transform method is used to fit the model and transform the documents into a bag of words.

Converting to an array:

X_array = X.toarray()

We convert the sparse matrix result to a dense array for easy viewing.

Getting Feature Names:

feature_names = vectorizer.get_feature_names_out()

The get_feature_names_out method retrieves the unique words identified in the corpus.

Printing Results:

print(“Feature Names:”, feature_names) print(“Bag of Words: n”, X_array)

Finally, we print the feature names and the bag of words.

Word Cloud: Visualizing Text Data

What Is a Word Cloud?

A word cloud is a visual representation of text data where the size of each word indicates its frequency or importance. It provides an intuitive and appealing way to understand the most prominent words in a text corpus.

Why Use Word Cloud?

Word clouds are used to:

  • Quickly grasp the most frequent terms in a text.
  • Visually highlight important keywords.
  • Present text data in a more engaging format.

Pros and Cons

Pros:

  • Easy to interpret and visually appealing.
  • Highlights key terms effectively.

Cons:

  • Can oversimplify the text data.
  • May not be suitable for detailed analysis.

Implementation

Here’s how to create a word cloud using Python:

from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Sample text
df = pd.read_csv('/content/AmazonReview.csv')
comment_words = ""
stopwords = set(STOPWORDS)
for val in df.Review:
val = str(val)
tokens = val.split()
for i in range(len(tokens)):
tokens[i] = tokens[i].lower()
comment_words += "".join(tokens) + ""
pic = np.array(Image.open(requests.get('https://www.clker.com/cliparts/a/c/3/6/11949855611947336549home14.svg.med.png', stream = True).raw))
# Generate word clouds
wordcloud = WordCloud(width=800, height=800, background_color='white', mask=pic, min_font_size=12).generate(comment_words)
Display the word cloud
plt.figure(figsize=(8,8), facecolor=None)
plt.imshow(wordcloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

Code Explanation

  • Importing Libraries:

from wordcloud import WordCloud import matplotlib.pyplot as plt

We import the WordCloud class from the wordcloud library and matplotlib.pyplot for displaying the word cloud.

Generating Word Clouds:

wordcloud = WordCloud(width=800, height=800, background_color=’white’).generate(comment_words)

We create an instance of WordCloud with specified dimensions and background color and generate the word cloud using the sample text.

WordCloud Output

Conclusion

In this blog, we’ve explored three essential NLP preprocessing techniques: stopwords removal, bag of words, and word cloud generation. Each technique serves a unique purpose in the text preprocessing pipeline, contributing to the overall effectiveness of NLP tasks. By understanding and implementing these techniques, we can transform raw text data into meaningful insights and powerful features for machine learning models. Happy coding and exploring the world of NLP!

This brings us to the end of this article. I hope you have understood everything clearly. Make sure you practice as much as possible.

If you wish to check out more resources related to Data Science, Machine Learning and Deep learning, you can refer to my Github account.

You can connect with me on LinkedIn — RAVJOT SINGH.

I hope you like my article. From a future perspective, you can try other algorithms or choose different values of parameters to improve the accuracy even further. Please feel free to share your thoughts and ideas.

P.S. Claps and follows are highly appreciated.


Exploring NLP Preprocessing Techniques: Stopwords, Bag of Words, and Word Cloud was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Maruti Suzuki Accelerator Welcomes Global Startups for Its 9th Cohort

Maruti Suzuki India has come up with its 9th cohort of corporate accelerator programme titled the Maruti Suzuki Accelerator. The programme has been rebranded this year and is open to global participants including startups. The primary goal is to develop cutting-edge solutions for the automobile manufacturing and mobility market. The programme was previously titled the […]

How to Check if the Content Generated is AI Generated or Human

This is an era of advanced technology. We have stepped into the world of Artificial Intelligence (AI) and it has revolutionized content creation process. It is not to turn to oblivion that opportunities are not without challenges. The same is true with AI as the models like Natural Language Generation (NLG) and Generative Adversarial Networks […]

MOL Unveils Next-Gen Coastal Tanker with AI and IoT Technology

Sustainability is the key in today’s lifestyle and sustainable maritime transport is the newest entry in the segment. Mitsui O.S.K. Lines, Ltd. (MOL) has launched a state-of-the-art coastal tanker named Daiichi Meta Maru and it is designed to operate on environmentally friendly methanol fuel. The vessel is jointly owned by MOL Coastal Shipping, Tabuchi Kaiun […]

Ahmedabad’s QarmaTek Secures $1 Million in Latest Funding Round

The tech world is ever-evolving. Gadgets are becoming outdated at a rapid pace. Ahmedabad-based startup QarmaTek is making waves with a perfect solution. It has emerged with unique approach to give electronic products a second life. QarmaTek was founded by Krunal Shah and Arun Hattangadi in 2011. It has carved out a niche in the […]

Tips for Marketing Your Startup on a Student Budget

Marketing an entrepreneurial venture as a student could prove challenging. For example, the fact that you have limited resources means you could find it hard to penetrate your market. However, it’s still possible to attain successful marketing even with a limited budget. All you need is to employ the right strategies, and here are some […]

Challenges of Long-Distance Hydrogen Fuel Transportation and Their Solutions

Right now, the entire world is witnessing a global push for cleaner energy. As this push intensifies, hydrogen fuel is emerging as a promising alternative to fossil fuels. This, of course, is understandable given that hydrogen, particularly when derived from renewable energy sources, offers a clean, efficient, and versatile energy solution. However, when it comes […]