Updated February 20, 2023
Introduction to NLTK Stop Words
Nltk stop words are widely used words (such as “the,” “a,” “an,” or “in”) that a search engine has been configured to disregard while indexing and retrieving entries. Pre-processing is transforming data into a format that a computer can understand. Filtering out worthless data is one of the most common types of pre-processing. Stop words are used in NLTK to refer the meaningless words.
What is NLTK Stop Words?
- We don’t want any terms which eat up storage space or processing time in our database. We may easily eliminate them by keeping a list of terms we regard as stop words. Stopwords are kept in 16 distinct languages in the NLTK. The nltk data directory contains them.
- NLP is a field of study that deals with various issues, including natural language comprehension.
- This is simple to achieve if we keep a list of words we regard as stop words. Therefore, the NLTK corpus lists the terms they consider the stopwords.
- Stop words are words that frequently appear in any language or corpus. However, they contribute no additional text, including them for several NLP tasks.
- Data pre-processing is the sentence of processing or words that the user inputs or sends. Removing unnecessary or incomplete data is one of the most critical phases in data pre-processing.
- NLP difficulties, it’s crucial to remember that words like ‘the,’ ‘is,’ ‘there,’ and so on shouldn’t be processed. Stop words are the names for certain types of words. If stop words aren’t coded to be ignored or erased, they’ll be disregarded.
- To free up more memory or database space, as a result, the code’s efficiency suffers significantly.
- A separate stop words package is available for download from the NLTK package. Stop words are available for download and use in different languages through NLTK. It can be supplied as an argument after downloading, indicating that these words should be ignored.
- They can be safely ignored without jeopardizing the sentence’s meaning. Such words have previously been caught in the corpus. We begin by installing it in our Python environment.
- To use nltk stopwords first, we need to download the same in our python environment.
- Below is the code example to download the stopwords for the python environment. But, first, we must import the nltk module into our program code to download the stopwords.
import nltk
nltk.download ('stopwords')
- In the above example, we have seen that after executing the download command, it will show that the package of stopwords is up to date, and we have no need to download anything to use the nltk stopwords in our code.
NLTK Stop Words Program
- Below is the py_file.txt input file from which stopwords will be deleted in the code below. The resulting file is file1.txt.
- In the above example, we have first imported the io and nltk module; after importing the module, we have set the dictionary of stopwords in English. To set the dictionary of stopwords, we have created the object of the stopwords dictionary as stop_words.
- After defining the stopwords dictionary, we have to open the py_file in read-only mode; for opening the file in the read-only mode, we have to create the object as py_file.
- After opening the file in read-only mode, we have to call the read method to read the defined file; for the same, we have created the py_line object.
- After we split the words by calling the split method, we have to write and close the file.
Code:
import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words ('english'))
py_file = open (r"C:\Users\OMSAI\Desktop\py_file.txt")
py_line = py_file.read()
words = py_line.split()
for r in words:
if not r in stop_words:
afile = open(r'C:\Users\OMSAI\Desktop\file1.txt','a')
afile.write(" "+r)
afile.close()
NLTK Stop Words Remove
While no uniform stop phrases exist in NLP, many Python NLP modules do. Therefore, we can also make our stop-words list.
We won’t write our own stop words because we’ll use the NLTK library’s list. Therefore, we must first download the NLTK library before the stopwords.
Below are the steps to remove stopwords from nltk python as follows.
- To remove the stopwords from nltk in python first, we need to import and download it. The below example shows importing the nltk module and downloading the stopwords library.
Code:
import nltk
nltk.download('stopwords')
- We then lowercase our text and divide the list of the words. Following that, we make a new list of terms that aren’t on the stop words list.
- The example below shows how to remove the nltk stopwords in python.
Code:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
py_text = "Example of how to remove nltk stopwords in python"
print ("Text:", py_text)
py_token = word_tokenize (py_text.lower ())
print ("Tokens:", py_token)
eng_stopw = stopwords.words ('english')
py_stopw = [t for t in py_token if t not in eng_stopw]
print ("Text which not contains any stopwords:", " ".join (py_stopw))
- In the above example, we first imported the nltk module into our program; then, we downloaded the stopwords library. Then we imported the stopwords and word_tokenize module.
- After importing that module, we have defined random text from which we have removed the stopwords in python and then printed this random text. After printing the text, the actual execution of the code is staring.
- At last, we can see that how, to, and in keywords are removed from the above code.
NLTK Stop Words List
- We can check the list of stopwords by using the following commands are as follows. To retrieve the stopwords, we must import the same in our code. After importing the stopwords, we retrieve the same using the set command.
from nltk.corpus import stopwords
set (stopwords.words ('english'))
- In the above example, we have retrieved multiple stopwords using nltk. Nltk stopwords are very important in python programming.
Conclusion
NLP is a field of study that deals with various issues, including natural language comprehension. For example, Nltk stopwords is a widely used word (such as “the,” “a,” “an,” or “in”) that a search engine has been configured to disregard while indexing and retrieving entries.
Recommended Articles
This is a guide to NLTK Stop Words. Here we also discuss the definition, program, and how to remove Stop Words from NLTK along with the list. You may also have a look at the following articles to learn more –