Be at the forefront of AI evaluation by joining the Copilot Offline Evaluation Platform team and help us deliver the platform that makes Copilot innovation fast, reliable, and regression-free.
As a Principle Applied Scientist , you will help transform how Copilot features are evaluated and improved. Your team will deliver end-to-end experimentation, evaluation, and insights to Copilot engineers, PMs, and fellow scientists. You'll lead the development of a robust data generation platform that simulates realistic user behaviors, curate representative datasets, and develop comprehensive query sets and evaluation tooling. You'll work on scalable pipelines that support offline evaluations for Copilot Search, BizChat, Connectors, and Agents (DAs). This opportunity will allow you to dive deep into Copilot technologies, shape how AI quality is measured at scale, build critical skills in the AI-era, and rapidly grow your career.
Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Responsibilities
Drive the technical strategy for offline evaluation, ensuring alignment with product goals and scientific rigor in methodology, metrics, and automation.
Lead the design and implementation of robust evaluation pipelines that simulate real-world usage and reflect diverse user intents and preferences.
Develop and own critical quality metrics that capture user satisfaction, model fidelity, and regression risks-serving as trusted signals for product decisions.
Guide the creation of high-fidelity synthetic data using LLMs to simulate complex user interactions, enabling scenario coverage at scale.
Perform deep analysis of evaluation results to surface actionable insights, diagnose weaknesses, and influence prioritization across Copilot canvases.
Mentor scientists and partner with engineering and PM leads to integrate insights into experimentation, product loops, and model iteration cycles.
Establish best practices and influence standards through documentation, internal forums, and contributions to the broader applied science community (e.g., talks, publications, cross-org collaboration).
Qualifications
Required Qualifications:
Bachelor's Degree in Statistics, Econometrics, Computer Science, Electrical or Computer Engineering, or related field AND 6+ years related experience (e.g., statistics, predictive analytics, research)
OR Master's Degree in Statistics, Econometrics, Computer Science, Electrical or Computer Engineering, or related field AND 4+ years related experience (e.g., statistics, predictive analytics, research)
OR Doctorate in Statistics, Econometrics, Computer Science, Electrical or Computer Engineering, or related field AND 3+ years related experience (e.g., statistics, predictive analytics, research)
OR equivalent experience.
Minimum of 6 years of industry experience in applied machine learning, data science, or AI Systems.
Proven analytical mindset with a data-driven approach to problem-solving, consistently upholding high standards of quality and engineering rigor.
Collaborative and team-oriented, skilled at working across disciplines, levels, and product areas to drive alignment and shared success.
Other Requirements:
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings:
Microsoft Cloud Background Check : This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Preferred Qualifications:
Master's Degree in Statistics, Econometrics, Computer Science, Electrical
OR Computer Engineering, or related field AND 9+ years related experience (e.g., statistics, predictive analytics, research)
OR Doctorate in Statistics, Econometrics, Computer Science, Electrical or Computer Engineering, or related field AND 6+ years related experience (e.g., statistics, predictive analytics, research)
OR equivalent experience.
3+ years experience developing and deploying live production systems, as part of a product team.
Experience with synthetic data generation, data ingestion, and management, especially for evaluation or training purposes.
Experience designing or implementing evaluation metrics and methodologies for LLMs or generative AI systems.
Experience developing agentic solutions using LLMs or multi-agent frameworks.
Applied Sciences IC5 - The typical base pay range for this role across the U.S. is USD $139,900 - $274,800 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $188,000 - $304,200 per year.
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay
Microsoft will accept applications for the role until September 1, 2025.
M365Core
Microsoft is an equal opportunity employer. Consistent with applicable law, all qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations (https://careers.microsoft.com/v2/global/en/accessibility.html) .