Summary and Schedule

Welcome to the Text Analysis workshop for Python! Below is the list of lessons including a brief summary.

{% comment %} This is a comment in Liquid {% endcomment %}

Prerequisites

Python experience is required for this workshop.

Download files required for the lesson

00h 00m

1. Introduction to Natural Language Processing

What is Natural Language Processing?
What tasks can be done by Natural Language Processing?
What does a workflow for an NLP project look?

00h 10m

2. Corpus Development- Text Data Collection

How do I evaluate what kind of data to use for my project?
What do I need to consider when building my corpus?

00h 50m

3. Preparing and Preprocessing Your Data

How can I prepare data for NLP?
What are tokenization, casing and lemmatization?

01h 10m

4. Vector Space and Distance

How can we model documents effectively?
How can we measure similarity between documents?
What’s the difference between cosine similarity and distance?

01h 50m

5. Document Embeddings and TF-IDF

What is a document embedding?
What is TF-IDF?

02h 20m

6. Latent Semantic Analysis

What is topic modeling?
What is Latent Semantic Analysis (LSA)?

02h 50m

7. Intro to Word Embeddings

How can we extract vector representations of individual words rather than documents?
What sort of research questions can be answered with word embedding models?

03h 35m

8. The Word2Vec Algorithm

How does the Word2Vec model produce meaningful word embeddings?
How is a Word2Vec model trained?

04h 20m

9. Training Word2Vec

How can we train a Word2Vec model?
When is it beneficial to train a Word2Vec model on a specific dataset?

05h 25m

10. Finetuning LLMs

How can I fine-tune preexisting LLMs for my own research?
How do I pick the right data format?
How do I create my own labels?
How do I put my data into a model for finetuning?
How do I evaluate success at my task?

07h 25m

11. Ethics and Text Analysis

Is text analysis artificial intelligence?
How can training data influence results?
What are the risk zones to consider when using text analysis for research?

08h 05m

Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.

Clone the git project

There are a number of files we need to conduct the workshop. By cloning the git repository, you will be able to have a copy of all of the necessary files and functions required to run through the exercises.

Click the link below to open the Github page.
Click the green “Code <>” button.
Click “Download as zip.”
Unzip the directory to your desktop or working directory.

Google Colab Setup

We will be using Google Colab to run Python code in our browsers. Colab was chosen to ensure all learners have similar processing power (using Google’s servers), and to streamline the setup required for the workshop.

If you’ve never opened a new Colab notebook, first visit the Google Colab website and click “New notebook” from the pop-up that shows up. When you open your first script, a “Colab Notebooks” folder will automatically be created in Google Drive.
Visit Google Drive and find a newly created “Colab Notebooks” folder stored under MyDrive, /My Drive/Colab Notebooks
Create a folder named text-analysis in the Colab Notebooks folder on Google Drive. The path should look like this: /My Drive/Colab Notebooks/text-analysis/.
Upload the “data” and “code” folders that were downloaded from git (inside the “python-text-analysis-gh-pages” folder) to the “text-analysis” folder you created in Google Drive: /My Drive/Colab Notebooks/text-analysis/data and /My Drive/Colab Notebooks/text-analysis/code
At the start of each episode during the workshop, you can create a fresh Colab notebook within the text-analysis folder by navigating to /My Drive/Colab Notebooks/text-analysis/ within Google Drive, followed by clicking New -> More -> Google Colaboratory