NLP -Beginners Guide to Classification in 4 Steps

Published in

CodeX

9 min readJul 29, 2022

Machine Learning and “Natural Language Processing” (NLP) can both uncover previously invisible patterns in these kinds of datasets, but can also automate certain tasks, freeing up people to do the higher-value, more creative work that machines can’t do. In this article, I will try to succinctly explain NLP in simple words and showcase how it can be used to help non-profits.

Data Ask- “We’ve Seen Text Like This Before.”: Classification

We come to the classic and ubiquitous task that has made machine learning so successful: classification. Classification takes a set of input features and produces an output classification, frequently a binary yes/no.

I had scraped information from various websites of various non-profits (I will discuss web-scraping in another article) in the USA. My starting point was this one big excel file of all USA-based non-profits. The ask was

To extract emails and phone numbers from all the content and
If the column “CJ_Mission?” correctly identified those non-profits that had something to do with criminal justice. Simple!

To filter for emails and Phone Numbers → these two lines in Python can do a pretty good job given the time constraint.

Github Link : Here is the GitHub link to see what the data looks like :


df[‘Phone’] = df[‘Content’].str.extract(r’((?:\(\d{3}\)|\d{3})(?:\s|\s?-\s?)?\d{3}(?:\s|\s?-\s?)?\d{4})’,expand= True)df[‘E-mail’] = df[‘Content’].str.extract(r’([a-zA-Z][a-zA-Z0–9._%+-]+@[a-zA-Z0–9.-]+\.[a-zA-Z]+)’)

Problem Statement: Very simple. Figure out if the Non-Profits operate in the criminal justice arena or not. The data already had a column where an intern had entered information on whether the non-profit operated in the domain of criminal justice or not(Yes, No). The ask is to cross-check all this real quick.

EDA on the data

I will use a really small subset of the actual data to demonstrate the workings & I have removed the extra data columns I had in the original report as well.

See here:

You can see how the target variable”CJ_Mission?” has a an almost equal measure of ‘Yes’ and No’. Good.

Criminal Justice non-profits seem more wordy because on an average it contains more words. See below.

data['word_count'] = data['Mission Statement'].apply(lambda x: len(str(x).split()))
print(' Avg words used in  Mission Statement column of Criminal Justice non-profits is ',data[data['CJ_Mission?']=='Yes']['word_count'].mean()) #Disaster tweets
print('Avg words used in  Mission Statement column of NON- Criminal Justice non-profits is',data[data['CJ_Mission?']=='No']['word_count'].mean())
# PLOTTING WORD-COUNT
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,4))
train_words=data[data['CJ_Mission?']== 'Yes']['word_count']
ax1.hist(train_words,color='red')
ax1.set_title('Criminal Justice')
train_words=data[data['CJ_Mission?']== 'No']['word_count']
ax2.hist(train_words,color='green')
ax2.set_title('Non-Criminal-Justice')
fig.suptitle('Words per Mission Statement')
plt.show()

Overall Approach:

An algorithm that combines two machine learning classifiers. Each classifier will read in a brief description of the “Mission Statement” column of the non-profit & then determine if it relates to criminal justice or not.

Things you need to know about Text Preprocessing

Data preprocessing is the phase of preparing raw data to make it suitable for a machine learning model. For NLP, that includes text cleaning, stopwords removal, stemming, and lemmatization.

What are stop words?-Stop-words are the words in any language which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on.
When to remove stop words?- If we have a task of text classification or sentiment analysis then we should remove stop words as they do not provide any information to our model, i.e keeping out unwanted words out of our corpus. There is no hard and fast rule on when to remove stop words.
How to remove stop words- We will use a stemming algorithm called Snowball stemmer which is also known as the Porter2 stemming algorithm.
What is Stemming?- In simple words stemming is reducing a word to its base word or stem in such a way that the words of a similar kind lie under a common stem. For example — The words care, cared, and caring lie under the same stem ‘care’.
Text cleaning steps vary according to the type of data and the required task. Generally, the string is converted to lowercase, and punctuation is removed before the text gets tokenized. Tokenization is the process of splitting a string into a list of strings (or “tokens”).

Steps to Get There :

Step 1: Create a List of Stop words

So, first, we’re going to stem the words to reduce the words to their root in order to limit differences based on tense or whether they appear in the plural or possessive form. Then, we’re going to strip out a custom list of stop words.

Custom Stop Words

You might think, why custom stop words? While it is fairly easy to use a published set of stop words, in many cases such as this one, using such stop words is completely insufficient for certain applications. For example, in clinical texts, terms like “mcg” “dr.” and “patient” occur almost in every document that you come across. So, these terms may be regarded as potential stop words for clinical text mining and retrieval. Let us find our own. There are 3 ways to do this & I will choose a combination of first two.

1. Most frequent terms as stop words
You can take the top N terms to be your stop words. You can also eliminate common English words (using a publish stop list) prior to sorting so that you are sure that you target the domain specific stop words.
2. Least frequent terms as stop words
Terms that are extremely infrequent may also not be useful for text mining and retrieval. However, despite all the normalization if terms still have a term frequency count of one you could remove it. This could significantly reduce your overall feature space.
3. Low IDF terms as stop words
Inverse document frequency (IDF) basically refers to the inverse fraction of documents in your collection that contains a specific term ti. Let us say you have N documents. And term ti occurred in M of the N documents. The IDF of ti is thus computed as:
IDF(ti)=Log N/M
So the more documents ti appears in, the lower the IDF score. This means terms that appear in each and every document will have an IDF score of 0. If you rank each ti in your collection by its IDF score in descending order, you can treat the bottom K terms with the lowest IDF scores to be your stop words.

# Define a standard snowball stemmer
STEM = SnowballStemmer('english')
# Make a list of stopwords, including the stemmed versions
# These are words that have no impact on the classification, and
# can even occasionally mess up the classifier.STOP = ['in',
    'non profit',
    'from',
    'the',
    'and',
    'their', 
    'after',
    'for',
    'in',
    'that',
    'our',
    'we',
    'that',
    'mission',
    'our',
    'in ',
    'of',
    'to',
    'a', 'people', 'IN', 'social', 'community' 'BY', 'OF', 'IN',
    'and',
    'to',
    'the',
    'of ',
    'AND',
    'the',
    'THE',
    'and',
    'in ',
    'as',
    'is',
    'by',
    'of',
    'to',
    'a', 'CIVIL', 'organization'
]STOP += [STEM.stem(i) for i in STOP]
print(STOP)
STOP = list(set(STOP))
print(STOP)

Note: I just came up with my own based on some manual tuning, checking & adjusting. Running this for this small subset will give you some idea on what is important, what is not, the frequency of words etc.

from collections import Counter
Counter(" ".join(df3["Mission Statement"]).split()).most_common(100)

Step 2: Tokenize

This is a function where we take a description and break it up into the individual “features” we’re going to use to classify it. We separate the description into individual words, then stem them and remove stop words. From there, we make a list of individual words and then combine them into bigrams.

def tokenize(desc):
    """
    Takes description text, strips out unwanted words and text,
    and prepares it for the trainer.
    """
    # first lower case and strip leading/trailing whitespace
    desc = desc.lower().strip()
    # kill the 'do-'s and any stray punctuation
    desc = desc.replace('do-', '').replace('.', '').replace(',', '')
    # make a list of words by splitting on whitespace
    words = desc.split(' ')
    # Make sure each "word" is a real string / account for odd whitespace
    words = [STEM.stem(i) for i in words if i]
    
    words = [i for i in words if i not in STOP]
    # let's see if adding bigrams improves the accuracy
    bigrams = ngrams(words, 2)
    bigrams = ["%s|%s" % (i[0], i[1]) for i in bigrams]
    # bigrams = [i for i in bigrams if STEMMED_BIGRAMS.get(i)]
    # set up a dict
    output = dict([(i, True) for i in words + bigrams])
    # The NLTK trainer expects data in a certain format
    return output

Step 3: Pulling features

# open our sample file and use the CSV module to parse it
f = open('NP_CJ_train_.csv', 'rU')
data = list(csv.DictReader(f))# Make an empty list for our processed data
qualities  = []
# Loop through all the lines in the CSV
for i in data:
    desc = i.get('Mission Statement')
    classify = i.get('CJ_Mission?')
    feats = tokenize(desc)
    qualities.append((feats, classify))f.close()

Step 4: Train the classifiers

For this analysis we used two machine learning classifiers. The first is a linear support vector machine from the scikit-learn Python library. The second is a maximum entropy classifier.

# Train our classifiers. Let's start with Linear SVC
# Make a data prep pipeline
pipeline = Pipeline([
    ('tfidf', TfidfTransformer()),
    ('linearsvc', LinearSVC()),
])
# make the classifier
linear_svc = SklearnClassifier(pipeline)
# Train it
linear_svc.train(qualities)

Now les do Maximum Entropy

maxent = MaxentClassifier.train(qualities)

Testing the classifiers

Now let’s test these out! For this example, we’re only using a training sample of a few non profits. For the official analysis, I have used a training sample of many more data points . We also chose to use two classifiers because, though they agreed on the vast majority of criminal justice topics, sometimes with edge cases one classifier would do a better job. Let’s check the result.

test_data = list(csv.DictReader(open('NP_CJ_test_.csv', 'rU')))
d =[]
for i in test_data:
    desc = i.get('Mission Statement')
    
    classify = i.get('CJ_Mission?')
    
    tokenized = tokenize(desc)
    
    # now grab the results of our classifiers
    maxent_class = maxent.classify(tokenized)
  
    
    svc_class = linear_svc.classify(tokenized)
    # Generate result using pandas
    print('actual: %s | maxent: %s | linear_svc: %s |' % (classify, maxent_class, svc_class))
    
   
    d.append(( classify,maxent_class, svc_class))
   
    #print(d[-1])

Quick Check on Results

For SVC just one wrong prediction item here.

#print(d)
df2 = pd.DataFrame(d)
df2.columns = ['Correct_label', 'maxent_predict', 'svc_predict']
#print (df2)
contingency_matrix = pd.crosstab(df2['svc_predict'], df2['Correct_label'])print(contingency_matrix)from sklearn import metrics
target_names = ['Yes', 'No']import matplotlib.pyplot as plt
import seaborn as snplt.clf()ax = fig.add_subplot(111)ax.set_aspect(1)res = sn.heatmap(contingency_matrix.T, annot=True, fmt='.2f', cmap="YlGnBu", cbar=False)plt.savefig("crosstab_pandas.png", bbox_inches='tight', dpi=100)
plt.title(' Confusion Matrix  for SVC')
plt.show()print("SVC", classification_report(df2.Correct_label,df2.svc_predict , target_names=target_names))
print("Maxent", classification_report(df2.Correct_label,df2.maxent_predict , target_names=target_names))

Conclusion

This article demonstrated how to analyze text data with NLP and extract features for a machine learning model. Of course, this is just the beginning, and there’s a lot more that can be done to improve this model but I hope it provided a good starting point for ML aspirants.