Introduction to Natural Language Processing


  • NLP is comprised of models that perform different tasks.
  • Our workflow for an NLP project consists of designing, preprocessing, representation, running, creating output, and interpreting that output.
  • NLP tasks can be adapted to suit different research interests.

Corpus Development- Text Data Collection


  • You will need to evaluate the suitability of data for inclusion in your corpus and will need to take into consideration issues such as legal/ethical restrictions and data quality among others.
  • It is important to think critically about data sources and the context of how they were created or assembled.
  • Becoming familiar with your data and its characteristics can help you prepare your data for analysis.
  • NULL

Preparing and Preprocessing Your Data


  • Tokenization breaks strings into smaller parts for analysis.
  • Casing removes capital letters.
  • Stopwords are common words that do not contain much useful information.
  • Lemmatization reduces words to their root form.

Vector Space and Distance


  • We model documents by plotting them in high dimensional space.
  • Distance is highly dependent on document length.
  • Documents are modeled as vectors so cosine similarity can be used as a similarity metric.

Document Embeddings and TF-IDF


  • Some words convey more information about a corpus than others
  • One-hot encodings treat all words equally
  • TF-IDF encodings weigh overly common words lower

Latent Semantic Analysis


  • Topic modeling helps explore and describe the content of a corpus
  • LSA defines topics as spectra that the corpus is distributed over
  • Each dimension (topic) in LSA corresponds to a contrast between positively and negatively weighted words

Intro to Word Embeddings


  • Word emebddings can help us derive additional meaning stored in text at the level of individual words
  • Word embeddings have many use-cases in text-analysis and NLP related tasks

The Word2Vec Algorithm


  • Artificial neural networks (ANNs) are powerful models that can approximate any function given sufficient training data.
  • The best method to decide between training methods (CBOW and Skip-gram) is to try both methods and see which one works best for your specific application.

Training Word2Vec


  • As an alternative to using a pre-trained model, training a Word2Vec model on a specific dataset allows you use Word2Vec for NER-related tasks.

Finetuning LLMs


  • HuggingFace has many examples of LLMs you can fine-tune.
  • Examine preexisting examples to get an idea of what your model expects.
  • Label Studio and other tagging software allows you to easily tag your own data.
  • Looking at common metrics used and other models performance in your subject area will give you an idea of how your model did.

Ethics and Text Analysis


  • Text analysis is a tool and can’t assign meaning to results
  • As researchers we are responsible for understanding and explaining our methods and results