Text Analysis in Python: Key Points

Pre-Alpha

Text Analysis in Python

Introduction to Natural Language Processing

NLP is comprised of models that perform different tasks.
Our workflow for an NLP project consists of designing, preprocessing, representation, running, creating output, and interpreting that output.
NLP tasks can be adapted to suit different research interests.

Corpus Development- Text Data Collection

You will need to evaluate the suitability of data for inclusion in your corpus and will need to take into consideration issues such as legal/ethical restrictions and data quality among others.
It is important to think critically about data sources and the context of how they were created or assembled.
Becoming familiar with your data and its characteristics can help you prepare your data for analysis.
NULL

Preparing and Preprocessing Your Data

Tokenization breaks strings into smaller parts for analysis.
Casing removes capital letters.
Stopwords are common words that do not contain much useful information.
Lemmatization reduces words to their root form.

Vector Space and Distance

We model documents by plotting them in high dimensional space.
Distance is highly dependent on document length.
Documents are modeled as vectors so cosine similarity can be used as a similarity metric.

Document Embeddings and TF-IDF

Some words convey more information about a corpus than others
One-hot encodings treat all words equally
TF-IDF encodings weigh overly common words lower

Latent Semantic Analysis

Topic modeling helps explore and describe the content of a corpus
LSA defines topics as spectra that the corpus is distributed over
Each dimension (topic) in LSA corresponds to a contrast between positively and negatively weighted words

Intro to Word Embeddings

Word emebddings can help us derive additional meaning stored in text at the level of individual words
Word embeddings have many use-cases in text-analysis and NLP related tasks

The Word2Vec Algorithm

Artificial neural networks (ANNs) are powerful models that can approximate any function given sufficient training data.
The best method to decide between training methods (CBOW and Skip-gram) is to try both methods and see which one works best for your specific application.

Training Word2Vec

As an alternative to using a pre-trained model, training a Word2Vec model on a specific dataset allows you use Word2Vec for NER-related tasks.

Finetuning LLMs

HuggingFace has many examples of LLMs you can fine-tune.
Examine preexisting examples to get an idea of what your model expects.
Label Studio and other tagging software allows you to easily tag your own data.
Looking at common metrics used and other models performance in your subject area will give you an idea of how your model did.

Ethics and Text Analysis

Text analysis is a tool and can’t assign meaning to results
As researchers we are responsible for understanding and explaining our methods and results