Resources
Tutorials and learning materials from the lab.
Biomedical Text Mining for Knowledge Extraction Tutorial @ ISMB 2025
A hands-on introduction to the core tasks in biomedical natural language processing (BioNLP) for extracting knowledge from text. Topics include named entity recognition, entity linking, relation extraction, and the strengths and weaknesses of large language models for information extraction. Visit the tutorial website for slides and Jupyter notebooks.
Mini Code Tutorials
Short code tutorials on small concepts, hosted as Jupyter notebooks on GitHub.
NLP
- HuggingFace tokenizer tricks — options for using and customizing HuggingFace tokenizers
- Creating a custom token classification model — creating a custom model in HuggingFace
- Further pretraining a language model and generating text with it — training a causal language model and generating text
- Creating a HuggingFace dataset object — creating a Dataset object that works with the HuggingFace Trainer
Knowledge Graphs
- Two-Hop Link Prediction for a Knowledge Graph — link prediction in a knowledge graph using a basic rule-based approach
Biomedical
- Getting MeSH Tags with Entrez API — using the Entrez API to access MeSH tags on biomedical documents
- Loading BioC XML files plus dealing with an archive of them — using the bioc package to store and access biomedical documents and annotations
- Loading a PubMed XML file — loading a PubMed XML bulk-download file and extracting title, abstract, and metadata
Entity Linking and Information Retrieval
- Candidate Generation for Entity Linking — methods for identifying candidate entities given a mention
- Training a Bi-encoder End-to-End for Entity Linking — making a transformer output similar vectors for texts referring to the same entity
- Vector similarity with a neural network — training a neural network for dot-product similarity using PyTorch
Deep Learning
- Example of PyTorch Training Loop with Custom Hugging Face Model — a standard PyTorch training loop with custom batch setup