Resources

Tutorials and learning materials from the lab.

Biomedical Text Mining for Knowledge Extraction Tutorial @ ISMB 2025

A hands-on introduction to the core tasks in biomedical natural language processing (BioNLP) for extracting knowledge from text. Topics include named entity recognition, entity linking, relation extraction, and the strengths and weaknesses of large language models for information extraction. Visit the tutorial website for slides and Jupyter notebooks.

Mini Code Tutorials

Short code tutorials on small concepts, hosted as Jupyter notebooks on GitHub.

NLP

HuggingFace tokenizer tricks — options for using and customizing HuggingFace tokenizers
Creating a custom token classification model — creating a custom model in HuggingFace
Further pretraining a language model and generating text with it — training a causal language model and generating text
Creating a HuggingFace dataset object — creating a Dataset object that works with the HuggingFace Trainer
Sentence-level Relation Extraction with BERT — turning relation extraction into a text classification problem and training a classifier

Knowledge Graphs

Two-Hop Link Prediction for a Knowledge Graph — link prediction in a knowledge graph using a basic rule-based approach

Biomedical

Getting MeSH Tags with Entrez API — using the Entrez API to access MeSH tags on biomedical documents
Loading BioC XML files plus dealing with an archive of them — using the bioc package to store and access biomedical documents and annotations
Loading a PubMed XML file — loading a PubMed XML bulk-download file and extracting title, abstract, and metadata

Entity Linking and Information Retrieval

Candidate Generation for Entity Linking — methods for identifying candidate entities given a mention
Multicrossencoders for Entity Linking — showing a full entity linking pipeline with a fast reranker
Training a Bi-encoder End-to-End for Entity Linking — making a transformer output similar vectors for texts referring to the same entity
Vector similarity with a neural network — training a neural network for dot-product similarity using PyTorch

Deep Learning

Example of PyTorch Training Loop with Custom Hugging Face Model — a standard PyTorch training loop with custom batch setup