2026 Cartel Presence Automation Project
About: This project develops an automated system to generate a fine-grained dataset on cartel presence, power, and modus operandi across Mexico. Building on prior research that manually compiled data on organized crime activity, the current phase seeks to leverage Large Language Models (LLMs) and machine learning (ML) techniques to extract, classify, and validate information from diverse textual and media sources. The goal is to create a state-of-the-art, reproducible dataset that enhances both the temporal and spatial precision of our existing cartel presence data.
Research Tasks: The undergraduate Research Assistant (RA) will support the automation and refinement of our cartel presence dataset. Under the supervision of the research team, the RA will implement and test computational methods for information extraction, entity recognition, and classification using both supervised and unsupervised approaches. The RA will also help evaluate and document the outputs of language models, ensuring accuracy, transparency, and replicability of results.
Key Responsibilities:
- Data Collection and Preprocessing
- Gather and clean textual data from public sources (e.g., news archives, government reports, social media, or NGO bulletins).
- Format and preprocess text data for machine learning pipelines (tokenization, normalization, metadata tagging).
- Assist in developing scripts for automated data retrieval or web scraping (where ethically and legally permissible).
- LLM and ML Model Development.
- Implement and fine-tune LLM-based methods for entity recognition (e.g., identifying cartels, locations, events).
- Develop or assist in training ML models for text classification and clustering related to cartel activity.
- Compare performance across models (e.g., GPT, Claude, fine-tuned BERT variants) and document evaluation metrics.
- Validation and Quality Assurance.
- Conduct manual validation of model outputs to ensure reliability and reduce false positives/negatives.
- Contribute to the creation of a benchmark or “gold standard” subset of validated data.
- Use visualization and descriptive statistics to summarize results and identify model biases or gaps.
- Documentation and Collaboration.
- Maintain clear documentation of workflows, data preprocessing steps, and model configurations.
- Contribute to GitHub repositories, including version control, code commenting, and README updates.
- Participate in weekly research meetings to discuss progress, challenges, and methodological improvements.
Qualifications:
- Strong programming skills in Python (preferred libraries: pandas, scikit-learn, transformers, spaCy, or OpenAI API).
- Familiarity with machine learning, natural language processing, or data science workflows.
- Coursework or demonstrated interest in computational social science, political science, data analysis, or Latin American studies.
-
Attention to detail, strong organizational skills, and commitment to research ethics.
Base stipend is $8500 with up to $1500 additional stipend based on financial need.
- This is a full-time position; students are expected to participate 35+ hours/week for 10 consecutive weeks. Participants are not permitted to engage in another full-time internship, job, or volunteer opportunity (whether funded by Stanford or otherwise). They also may not hold a part-time internship, job, or volunteer opportunity unless their faculty mentors or program mentors have approved these arrangements before the start of the summer. Students also cannot receive an additional VPUE part-time grant within the same quarter.
- Students must be current undergraduates in good standing at Stanford during the summer quarter; those graduating in June are not eligible.
- Students may not receive both academic units and a stipend for any single project activity.
- Students pursuing a coterminal MA degree are eligible ONLY IF 1) they have not conferred their undergraduate degree AND 2) they are in the undergraduate (not graduate) tuition group.
- Students may not be serving a suspension or be on a Leave of Absence (LOA) while using grant funding.
