Every Beginner NLP Engineer must know these Techniques

Ankush Mulkar
6 min readJan 25, 2023

--

Self made image

Tokenization:

Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements, known as tokens.

Here is an example of tokenization in Python using the NLTK library:

import nltk
from nltk.tokenize import word_tokenize

text = "This is an example of tokenization."
tokens = word_tokenize(text)
print(tokens)
# Output: ['This', 'is', 'an', 'example', 'of', 'tokenization', '.']
# If you get error in above code add below code after import nltk
nltk.download('punkt')
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

Lemmatization:

Lemmatization is the process of reducing a word to its base or root form, called a lemma. Stemming is a similar process, but it often results in words that are not actual words.

Here is an example of lemmatization in Python using the NLTK library:

import nltk
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("running"))
# Output: 'running'
print(lemmatizer.lemmatize("ran"))
# Output: 'run'

Steaming:

In Natural Language Processing (NLP), “steaming” refers to the process of reducing a word to its base or root form. This is often done to group together different forms of a word so they can be analyzed together as a single item.

Here is an example of stemming in python using NLTK library

import nltk
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
print(stemmer.stem('running'))
# Output: 'run'
print(stemmer.stem('runner'))
# Output: 'runner'

Part-of-Speech Tagging:

Part-of-speech (POS) tagging is the process of marking each word in a text with its corresponding POS tag. Here is an example of POS tagging in Python using the NLTK library:

import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
text = "I am learning NLP techniques in Python."
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
print(pos_tags)
# Output: [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP'), ('techniques', 'NNS'), ('in', 'IN'), ('Python', 'NNP'), ('.', '.')]

Named Entity Recognition:

Named Entity Recognition (NER) is the process of identifying and classifying named entities in a text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. Here is an example of NER in python using NLTK

import nltk
from nltk import ne_chunk
from nltk.tokenize import word_tokenize
text = "Barack Obama was born in Hawaii."
tokens = word_tokenize(text)
tagged_tokens = nltk.pos_tag(tokens)
ner_tree = ne_chunk(tagged_tokens)
print(ner_tree)
# Output: (S (PERSON Barack))

Sentiment Analysis:

Sentiment Analysis is the process of determining the emotional tone behind a piece of text, whether it is positive, negative, or neutral. Here is an example of Sentiment Analysis in Python using the NLTK library:

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
text = "I love this product! It's amazing."
sia = SentimentIntensityAnalyzer()
score = sia.polarity_scores(text)
print(score)
# Output: {'neg': 0.0, 'neu': 0.192, 'pos': 0.808, 'compound': 0.6369}

Text Classification:

Text Classification is the process of assigning predefined categories or tags to a piece of text. Here is an example of Text Classification in Python using the scikit-learn library:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# Create a dataset
data = {'text': ['This is a positive text.', 'This is a negative text.'], 'label': ['positive', 'negative']}
df = pd.DataFrame(data)
# Create a CountVectorizer object
vectorizer = CountVectorizer()
# Transform the text column
X = vectorizer.fit_transform(df['text'])
# Create a MultinomialNB object
clf = MultinomialNB()
# Fit the model
clf.fit(X, df['label'])
# Test the model
text = "This is a neutral text."
X_test = vectorizer.transform([text])
pred = clf.predict(X_test)
print(pred)
# Output: ['positive']

Language Translation:

Language Translation is the process of converting text from one language to another.

Here is an example of Language Translation in Python using the googletrans library:

from googletrans import Translator
translator = Translator()
text = "I am learning NLP techniques in Python."
translated_text = translator.translate(text, dest='fr').text
print(translated_text)
# Output: "Je apprends des techniques NLP en Python."

Text Summarization:

Text summarization is the process of condensing a piece of text to its main points.

Here is an example of Text Summarization in Python using the gensim library:

from gensim.summarization import summarize
text = "Text summarization is the process of condensing a piece of text to its main points. The goal of summarization is to create a condensed version that retains the most important information from the original text. There are several methods for summarization including extraction-based methods and abstraction-based methods. Extraction-based methods select a subset of the words from the original text, while abstraction-based methods generate a new summary by using a model trained on the original text."
summary = summarize(text)
print(summary)
# Output: "There are several methods for summarization including extraction-based methods and abstraction-based methods. Extraction-based methods select a subset of the words from the original text

Word Embeddings (e.g. Word2Vec, GloVe):

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

Here is an example of training a Word2Vec model in Python using the gensim library:

from gensim.models import Word2Vec
# Define a dataset
sentences = [['This', 'is', 'a', 'positive', 'text'],
['This', 'is', 'a', 'negative', 'text'],
['This', 'is', 'a', 'neutral', 'text']]
# Train the model
model = Word2Vec(sentences, size=100, window=5, min_count=1, workers=4)
# Access the trained model's word vector
word_vector = model.wv['positive']
print(word_vector)
# Output: array([-1.90734863e-03, -1.52587891e-03, 4.57763672e-04, ...], dtype=float32)

Here is an example of loading pre-trained GloVe model in Python using the gensim library:

from gensim.models import KeyedVectors
# Load the model
model = KeyedVectors.load_word2vec_format('path/to/glove.6B.100d.txt', binary=False)
# Access the word vector
word_vector = model['word']
print(word_vector)

Dependency Parsing:

Dependency parsing is the process of analyzing the grammatical structure of a sentence, based on the dependencies between the words in the sentence.

Here is an example of Dependency Parsing in Python using the spaCy library:

import spacy
# Load the model
nlp = spacy.load("en_core_web_sm")
# Define a sentence
sentence = "I am learning NLP techniques in Python."
# Apply dependency parsing
doc = nlp(sentence)
for token in doc:
print(token.text, token.dep_)
# Output:
# I nsubj
# am ROOT
# learning acomp
# NLP compound
# techniques dobj
# in prep
# Python pobj

Note: while the above example uses spaCy library, there are other libraries such as NLTK, Stanford Parser, etc that can be used for Dependency parsing.

Topic modeling

Topic modeling is a method used in natural language processing (NLP) to identify patterns and topics in a text corpus. One popular technique for topic modeling is Latent Dirichlet Allocation (LDA), which uses a statistical model to discover latent topics in a set of documents.

Here is an example of how to perform topic modeling using LDA and the gensim library in Python:

from gensim.corpora import Dictionary
from gensim.models import LdaModel
# Example text corpus
texts = [["cat", "dog", "rat", "elephant"],
["cat", "dog", "rat", "mouse"],
["dog", "rat", "mouse"]]
# Create a dictionary from the texts
dictionary = Dictionary(texts)
# Create a Bag-of-Words (BoW) representation of the texts
corpus = [dictionary.doc2bow(text) for text in texts]
# Train an LDA model on the corpus
lda = LdaModel(corpus, num_topics=2, id2word=dictionary)
# Print the topics
for topic_id, topic in lda.print_topics():
print("Topic:", topic_id+1)
print(topic)

This example uses a simple text corpus containing three documents and trains an LDA model with 2 topics. The output will show the two topics learned by the model and the words that are associated with each topic.

Term frequency

Term frequency(tf) is a measure of how often a term appears in a document. It is commonly used in information retrieval and text mining. The tf-idf (term frequency-inverse document frequency) is a weighting scheme that assigns a weight to each term in a document based on its tf and idf.

Here is an example of how to calculate the term frequency of a document using python:

from collections import Counter
# Example document
document = "This is an example document. It contains several words, such as 'example' and 'document'."
# Tokenize the document
tokens = document.split()
# Count the frequency of each token
tf = Counter(tokens)
# Print the term frequency
print(tf)

This example will show the frequency of each word in the document in the form of a dictionary.

Mastering these techniques will give you a solid foundation for more advanced NLP tasks, such as language translation, text summarization, and question answering.

Follow given blog link to master in advance NLP techniques https://ankushmulkar.medium.com/top-most-ten-nlp-techniques-used-in-the-industry-34570a29f2f

To know more about Advance NLP, follow below link.

AnkushMulkar/Natural-Language-processing (github.com)

Ankush Mulkar Github portfolio

--

--