Summary and Schedule
Welcome to the Text Analysis workshop for Python! Below is the list of lessons including a brief summary.
{% comment %} This is a comment in Liquid {% endcomment %}
Prerequisites
Python experience is required for this workshop.
Setup Instructions | Download files required for the lesson | |
Duration: 00h 00m | 1. Introduction to Natural Language Processing |
What is Natural Language Processing? What tasks can be done by Natural Language Processing? What does a workflow for an NLP project look? |
Duration: 00h 10m | 2. Corpus Development- Text Data Collection |
How do I evaluate what kind of data to use for my project? What do I need to consider when building my corpus? |
Duration: 00h 50m | 3. Preparing and Preprocessing Your Data |
How can I prepare data for NLP? What are tokenization, casing and lemmatization? |
Duration: 01h 10m | 4. Vector Space and Distance |
How can we model documents effectively? How can we measure similarity between documents? What’s the difference between cosine similarity and distance? |
Duration: 01h 50m | 5. Document Embeddings and TF-IDF |
What is a document embedding? What is TF-IDF? |
Duration: 02h 20m | 6. Latent Semantic Analysis |
What is topic modeling? What is Latent Semantic Analysis (LSA)? |
Duration: 02h 50m | 7. Intro to Word Embeddings |
How can we extract vector representations of individual words rather
than documents? What sort of research questions can be answered with word embedding models? |
Duration: 03h 35m | 8. The Word2Vec Algorithm |
How does the Word2Vec model produce meaningful word embeddings? How is a Word2Vec model trained? |
Duration: 04h 20m | 9. Training Word2Vec |
How can we train a Word2Vec model? When is it beneficial to train a Word2Vec model on a specific dataset? |
Duration: 05h 25m | 10. Finetuning LLMs |
How can I fine-tune preexisting LLMs for my own research? How do I pick the right data format? How do I create my own labels? How do I put my data into a model for finetuning? How do I evaluate success at my task? |
Duration: 07h 25m | 11. Ethics and Text Analysis |
Is text analysis artificial intelligence? How can training data influence results? What are the risk zones to consider when using text analysis for research? |
Duration: 08h 05m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
Clone the git project
There are a number of files we need to conduct the workshop. By cloning the git repository, you will be able to have a copy of all of the necessary files and functions required to run through the exercises.
- Click the link below to open the Github page.
- Click the green “Code <>” button.
- Click “Download as zip.”
- Unzip the directory to your desktop or working directory.
Google Colab Setup
We will be using Google Colab to run Python code in our browsers. Colab was chosen to ensure all learners have similar processing power (using Google’s servers), and to streamline the setup required for the workshop.
- If you’ve never opened a new Colab notebook, first visit the Google Colab website and click “New notebook” from the pop-up that shows up. When you open your first script, a “Colab Notebooks” folder will automatically be created in Google Drive.
- Visit Google
Drive and find a newly created “Colab Notebooks” folder stored under
MyDrive,
/My Drive/Colab Notebooks
- Create a folder named
text-analysis
in the Colab Notebooks folder on Google Drive. The path should look like this:/My Drive/Colab Notebooks/text-analysis/
. - Upload the “data” and “code” folders that were downloaded from git
(inside the “python-text-analysis-gh-pages” folder) to the
“text-analysis” folder you created in Google Drive:
/My Drive/Colab Notebooks/text-analysis/data
and/My Drive/Colab Notebooks/text-analysis/code
- At the start of each episode during the workshop, you can create a
fresh Colab notebook within the text-analysis folder by navigating to
/My Drive/Colab Notebooks/text-analysis/
within Google Drive, followed by clickingNew -> More -> Google Colaboratory