Portfolio

Linguist (PhD, Boston University) with expertise in sociolinguistic research, dataset curation, and applied data science.

My portfolio highlights experience creating and curating linguistic datasets, designing annotation schemes, and supporting machine learning workflows. I specialize in bridging research and application—translating complex language data into insights and tools that advance both academic projects and industry technologies.

Language/Linguistic Data Creation and Analysis

Spanish in Boston Project (Boston University, PhD Research)

Managed collection, curation, and quality assurance (QA) of sociolinguistic datasets.
Designed annotation guidelines for novel variables to standardize workflows and improve data quality.
Built datasets for academic research (e.g., variation in Spanish liquids).
Supervised and trained student assistants in annotation and QA workflows as Lab Manager for the Spanish in Boston Project.
Led the full lifecycle of my dissertation project—from data design and collection to statistical modeling and visualization—demonstrating end-to-end research and data management skills.

Mirror Principle Violations Project

Surveyed descriptive materials across various languages.

Cogito Corporation (2022–2023, Data Annotator — Machine Learning Annotation)

Processed speech and language data for machine learning model development.
Created unique annotated datasets for internal and external clients.
Conducted prompt engineering for AI models to improve task accuracy.
Annotated audio for emotional engagement, rate of speech, energy level, and customer/agent experience.
Tested pre-trained ML language models and provided calibration suggestions.
Handled dynamic annotation requests across teams and contributed to workflow improvements.

D
a
t
a

Research Methods

Qualitative

Conducted linguistic fieldwork with Puerto Rican Spanish speakers in Puerto Rico and Louisiana.
Designed interview protocols for both exploratory research and hypothesis testing.

Quantitative

Designed and ran an online study on Spanish word order (Qualtrics • Prolific); results published in conference proceedings (2023, DOI).
Conducted coding, extraction, and statistical analysis of various sociolinguistic datasets.
Applied probabilistic methods to investigate specific linguistic variables (e.g., liquid use in Spanish), forming the quantitative foundation of my dissertation project, which involved managing and analyzing a 24K-token dataset.

Applied UX & Industry Insights

Provided sociolinguistic insights to conversational AI teams, informing user experience (UX) and model design.
Delivered research-based recommendations (e.g., conversational pause-fillers, speech patterns) that influenced annotation strategies and model training.

Me
t
h
o
d
s

Technical Experience

Statistical & Data Tools: R • regex • Python (developing proficiency) • Excel/Sheets

Linguistic & Writing Tools: Praat (app & scripting) • LaTeX • ELAN

Workflow & Version Control: bash/terminal • git/github

Scripting: Scripting for dataset processing, QA, and workflow optimization.

Regex testing: Built and refined patterns for clarity/accuracy in text processing.

ML workflows: Annotated, tested, and QA’d speech & language datasets for model development.

Lee-Ann Vidal Covas, PhD

[li.'aŋ bi.'ðal 'ko.βas]

Portfolio

Language/Linguistic Data Creation and Analysis

D
a
t
a

Research Methods

Me
t
h
o
d
s

Technical Experience

T
e
c
h

Portfolio

Language/Linguistic Data Creation and Analysis

D a t a

Research Methods

​​

Me t h o d s

Technical Experience

T e c h

D
a
t
a

Me
t
h
o
d
s

T
e
c
h