How District Technology Directors Should Evaluate AI Vendors
District technology directors face a procurement environment unlike anything in the history of educational technology. Between January 2023 and December 2024, the number of AI-powered education products tripled, according to EdSurge's product index. LearnPlatform's 2024 EdTech Top 40 report found the average U.S. school district now uses 1,403 unique digital tools — a 24% increase from the previous year — and AI tools represent the fastest-growing category.
The fundamental challenge isn't finding AI products. It's evaluating them. Traditional edtech procurement focused on features, pricing, and compliance certifications. AI tools require those same checks plus a fundamentally new set of questions: Where does our data go when the AI processes it? Does student input train the model? What happens when the AI is wrong? Who is liable for AI-generated errors in instructional content?
A CoSN 2024 survey found that only 23% of districts had a formal evaluation process specifically designed for AI tools — the remaining 77% were either applying traditional procurement frameworks (which miss critical AI-specific risks) or making ad hoc decisions without systematic evaluation.
This guide provides a structured evaluation framework built for the specific demands of AI in education.
Why Traditional Procurement Frameworks Fall Short
Standard edtech evaluation asks: Does the tool work? Is it accessible? Does it meet security requirements? Is the price reasonable? Those are necessary questions — but they're insufficient for AI.
| Traditional Evaluation | AI-Specific Evaluation (Also Required) |
|---|---|
| Does it work as described? | How does it work? (What AI model? What training data?) |
| Is student data secure? | Is student data used to train AI models? |
| Is it accessible (WCAG, Section 508)? | Does the AI output reflect bias that could harm students? |
| Is the cost reasonable? | What's the total cost including training, integration, and AI compute scaling? |
| Does it integrate with our SIS/LMS? | Does it share data with third-party AI providers (OpenAI, Google, Anthropic)? |
| Is the vendor financially stable? | Can the vendor survive the AI market consolidation happening right now? |
| Does the content meet standards? | Does AI-generated content contain errors, hallucinations, or inappropriate material? |
The right side of this table represents what most districts are not currently evaluating — and where the greatest risks lie.
Phase 1: Initial Screening (Week 1)
Before investing time in deep evaluation, screen vendors against non-negotiable requirements. This eliminates 40-60% of candidates quickly.
Non-Negotiable Requirements Checklist
INITIAL SCREENING — PASS/FAIL
□ PRIVACY AND COMPLIANCE
□ Vendor willing to sign your state's DPA
(or SDPC National DPA)
□ FERPA-compliant (school official exception
documented)
□ COPPA-compliant (if student-facing, K-8)
□ SOC 2 Type II certification (or equivalent
security audit)
□ No student data used for AI model training
(must be explicit, not implied)
□ Clear data deletion policy upon contract
termination
□ TECHNICAL
□ Runs on your existing infrastructure
(bandwidth, devices, browsers)
□ SSO/SAML integration available
□ WCAG 2.1 AA accessibility compliance
□ SIS/LMS integration via standard protocols
(LTI 1.3, OneRoster, SIF)
□ VENDOR VIABILITY
□ Vendor has been operating for 2+ years
(or has strong backing/revenue)
□ Vendor has existing K-12 district customers
(references available)
□ Pricing is within budget range
Any FAIL = STOP. Do not proceed to Phase 2.
Questions for Vendor Sales Calls
During initial discovery calls, ask these five questions. The quality of the answers tells you as much as the answers themselves:
| Question | What a Good Answer Sounds Like | Red Flag |
|---|---|---|
| "What AI model powers your product, and where is it hosted?" | "We use [specific model] hosted on [AWS/Azure/GCP]. Student data is processed in the US. We do not send data to third-party AI APIs without DPA coverage." | "We use proprietary AI" (no specifics) or unable to name the underlying model |
| "Is any student data — including prompts, interactions, and generated content — used to train or fine-tune your AI models?" | "No. Student data is not used for model training. This is documented in our DPA, Section X." | "Data may be used to improve our product" or vague non-answer |
| "What happens to our data if we cancel the contract?" | "All district data is deleted within 30 days. We provide a data export in standard format and written certification of deletion." | "We retain aggregated/anonymized data" without clear definition of anonymization |
| "Can you provide references from 3 districts of similar size and demographics?" | Readily provides references from districts you can verify | "Our customers prefer not to be contacted" or only provides testimonials |
| "What is your AI's error rate on educational content, and how do you measure accuracy?" | Provides specific accuracy metrics from internal testing or third-party evaluation | "Our AI is very accurate" with no data, or "Teachers review everything so errors don't matter" |
Phase 2: Deep Evaluation (Weeks 2-3)
For vendors that pass initial screening, conduct a structured deep evaluation across six dimensions.
Evaluation Scoring Framework
Score each dimension on a 1-5 scale. Weight the dimensions according to your district's priorities.
| Dimension | Weight (Suggested) | Evaluation Method |
|---|---|---|
| Educational effectiveness | 25% | Teacher pilot feedback, content quality review, alignment to standards |
| Privacy and data governance | 25% | DPA review, privacy policy analysis, security documentation |
| Technical integration | 15% | IT staff testing, SSO integration test, bandwidth assessment |
| Usability and adoption | 15% | Teacher/admin usability testing, training requirements assessment |
| Total cost of ownership | 10% | 3-year cost projection including hidden costs |
| Vendor stability and support | 10% | Financial review, support responsiveness testing, roadmap review |
1. Educational Effectiveness (25%)
This is the dimension that most evaluation frameworks under-examine. An AI tool can be technically excellent, fully compliant, and beautifully designed — and still produce mediocre educational content.
Content quality assessment protocol:
Select 10 representative generation tasks:
- 3 tasks in your most-taught subjects
- 2 tasks requiring differentiation (IEP
accommodations, ELL modifications)
- 2 tasks at different grade bands (K-2, 3-5,
6-8, 9-12)
- 1 task with specific state standards alignment
- 1 task that requires cultural sensitivity
- 1 intentionally vague task (tests how tool
handles ambiguity)
For each generated output, evaluate:
□ Factual accuracy (verified against standards
and subject knowledge)
□ Grade-level appropriateness (reading level,
cognitive demand)
□ Alignment to specified standards
□ Differentiation quality (not just simplified
language but genuinely different approaches)
□ Bias check (representation, cultural
sensitivity, assumptions)
□ Usability (can a teacher use this output
directly, or does it require significant
editing?)
Scoring: A tool that produces output teachers can use with minimal editing (10-15 minutes) on 7+ of 10 tasks scores 4-5. A tool requiring significant rework on most outputs scores 1-2. Platforms like EduGenius demonstrate the standard by providing Bloom's Taxonomy-aligned content with automatic differentiation and answer keys — benchmark vendor claims against these capabilities.
2. Privacy and Data Governance (25%)
Beyond the pass/fail screening, deep evaluation examines the specifics of data handling.
| Assessment Area | What to Examine | How to Verify |
|---|---|---|
| Data flow mapping | Exactly what data enters the AI, where it's processed, what's stored, what's returned | Request data flow diagram from vendor; verify against DPA |
| Sub-processor disclosure | Does the vendor use third-party AI providers (OpenAI, Anthropic, Google)? If so, under what terms? | Request complete sub-processor list; verify DPAs extend to sub-processors |
| Model training isolation | Is district data used in any form of model training, fine-tuning, or reinforcement learning? | Verify in DPA; request written attestation |
| Data residency | Where is data stored and processed? (Matters for GDPR-affected districts and state laws) | Request data center locations; verify against compliance requirements |
| Incident history | Has the vendor had previous data breaches or privacy incidents? | Check vendor's breach disclosure history; search for news reports |
See Legal Considerations for AI in Education — FERPA, COPPA, and GDPR for the full legal framework governing these requirements.
3. Technical Integration (15%)
| Test | Method | Pass Criteria |
|---|---|---|
| SSO integration | Configure SAML/OAuth with your IdP (Azure AD, Google Workspace, Clever) | Auto-provisioning works; teacher/student roles sync correctly |
| SIS/LMS integration | Test OneRoster or LTI 1.3 connection with your systems | Roster sync accurate; grade passback functional (if applicable) |
| Bandwidth impact | Monitor network during concurrent use (simulate 50+ users) | No significant impact on other services; acceptable latency (<3 seconds per AI response) |
| Device compatibility | Test on representative devices (Chromebooks, iPads, Windows laptops) | Full functionality on all district-supported devices |
| Offline/low-bandwidth | Test on degraded network (simulate rural/low-bandwidth conditions) | Graceful degradation; no data loss; clear error messages |
See Comparing AI Deployment Models — Cloud, On-Premise, and Hybrid for how deployment architecture affects these technical requirements.
4. Usability and Adoption (15%)
Test with 5-8 representative teachers covering different comfort levels with technology:
USABILITY TEST PROTOCOL
Task 1: First use (no training)
- Give teacher the tool with no instruction
- Ask them to complete a typical task
- Measure: time to first successful output,
number of errors, frustration indicators
Task 2: After 15-minute orientation
- Provide brief orientation video or guide
- Ask teacher to complete 3 varied tasks
- Measure: success rate, output quality,
confidence level
Task 3: After 1 week of daily use
- Teacher uses tool independently for 1 week
- Interview: What works? What doesn't? Would
you keep using this?
- Measure: continued use rate, reported time
savings, output quality over time
SCORING:
5 — Teachers productive within first use
4 — Teachers productive after brief orientation
3 — Teachers productive after structured training
2 — Teachers require ongoing support
1 — Teachers resist using despite training
5. Total Cost of Ownership (10%)
Vendor pricing tells you the license cost. Total cost of ownership tells you what you'll actually spend.
| Cost Category | Year 1 | Year 2 | Year 3 |
|---|---|---|---|
| License/subscription fees | Quote | Quote + escalation | Quote + escalation |
| Implementation/setup | Vendor fees + IT staff time | — | — |
| Training | Initial PD days + substitute costs | Refresher + new hire training | Refresher + new hire training |
| Integration | SSO/LMS setup (IT staff time or consultant) | Maintenance | Maintenance |
| Support overhead | Help desk tickets, IT troubleshooting | Ongoing | Ongoing |
| Usage scaling | Base usage | Growth (10-20% more users?) | Full adoption |
| TOTAL | Sum | Sum | Sum |
Hidden costs to watch for:
- Per-seat pricing that escalates with adoption ("starts at $3/teacher" but grows)
- AI compute charges based on usage volume
- Premium features locked behind higher tiers
- Integration costs not included in base price
- Training costs for new staff each year
6. Vendor Stability and Support (10%)
| Factor | How to Assess | Why It Matters |
|---|---|---|
| Funding/revenue | Ask about funding stage, revenue model, path to profitability | 30-40% of AI startups will not exist in 3 years (CB Insights, 2024) |
| Customer retention | Ask for retention rate and average contract duration | High churn = red flag for product quality or support |
| Support responsiveness | Submit a test support ticket during evaluation; measure response time | You will need support; test it before you commit |
| Product roadmap | Request roadmap for next 12 months; assess realism | Overpromising is common; grounded roadmaps indicate maturity |
| Exit strategy | Verify data portability, export formats, deletion timeline | If they fail, can you get your data out? |
Phase 3: Pilot Program (Weeks 4-8)
Don't go from evaluation to full deployment. A structured pilot reduces risk and generates the evidence you need for a deployment decision.
Pilot Design Template
PILOT PROGRAM STRUCTURE
Duration: 4-6 weeks (minimum 4 instructional
weeks)
Participants:
- 8-12 teachers (mix of tech-comfortable and
tech-hesitant)
- 2-3 grade bands or subject areas
- At least 1 school site (2 if possible for
comparison)
Clear Success Criteria (define BEFORE pilot):
□ X% of pilot teachers report net time savings
□ AI-generated content meets quality threshold
on Y% of outputs
□ No privacy incidents or policy violations
□ Teacher satisfaction score ≥ X on 1-5 scale
□ Technical issues resolved within X hours
Data Collection:
- Weekly 5-minute teacher survey (quantitative)
- Bi-weekly focus group or interview (qualitative)
- Usage analytics from vendor dashboard
- IT support ticket log
- Content quality samples (3-5 per teacher)
Decision Framework:
- Meets ALL success criteria → Proceed to
deployment planning
- Meets MOST criteria with manageable gaps →
Negotiate improvements; conditional deployment
- Fails multiple criteria → Do not deploy; share
specific feedback with vendor
Pilot timing matters: Don't pilot during the first or last two weeks of school, during testing windows, or during a week with multiple half-days. You need normal instructional conditions to get meaningful results.
Contract Negotiation
For vendors that pass evaluation and pilot, negotiate contracts deliberately. This is where technology directors have more leverage than they often realize.
Negotiation Points
| Area | What to Negotiate | Typical Outcome |
|---|---|---|
| Multi-year pricing | Lock rates for 2-3 years; cap annual increases at 3-5% | Most vendors will agree to caps in exchange for commitment |
| Usage tiers | Negotiate based on realistic adoption (not 100% Day 1) | Start with conservative tier; negotiate upgrade path |
| Exit clause | 90-day termination notice; no penalty after Year 1 | Protects against vendor underperformance |
| Data portability | Explicit right to export all district data in standard format | Essential; some vendors make export difficult |
| SLA guarantees | 99.5%+ uptime; 4-hour response for critical issues | Get specific commitments, not vague promises |
| Training inclusion | Initial training + annual refresher included in contract | Many vendors will include training to close the deal |
| Most-favored pricing | If vendor offers a lower price to a comparable district, you get the same rate | Uncommon but worth requesting for large districts |
Key Takeaways
- Only 23% of districts have AI-specific evaluation processes (CoSN, 2024). Traditional procurement frameworks miss critical AI-specific risks including model training data use, AI-generated content accuracy, and sub-processor data sharing. See AI for School Leaders — A Strategic Guide to Transforming Education Administration for strategic context.
- The most important privacy question is not "Is student data secure?" but "Is student data used to train AI models?" Many AI vendors use customer data for model improvement unless explicitly prohibited in the DPA. See Legal Considerations for AI in Education — FERPA, COPPA, and GDPR for the full legal framework.
- Evaluate educational effectiveness with the same rigor as technical compliance. A tool that passes every security check but produces mediocre content wastes money. Test with 10 representative tasks spanning subjects, grade levels, and differentiation needs. See Building a Culture of Innovation — Leading AI Adoption in Schools for adoption strategy.
- Always pilot before full deployment. 4-6 weeks with 8-12 teachers using the tool under normal conditions generates better evidence than any demo or sales call. Define success criteria before the pilot starts — not after.
- Total cost of ownership is 1.5-2.5× the license cost. Training, integration, support overhead, and usage scaling are real costs that vendor quotes don't include. Budget for the full picture. See AI for Student Enrollment Forecasting and Resource Planning for resource planning.
- Negotiate from strength. Vendors want multi-year contracts; you want flexibility. Trade commitment for rate locks, training inclusion, and exit clauses. Your existing DPA compliance and pilot data are negotiation assets. See Best AI Content Generation Tools for Educators — Head-to-Head Comparison for tool comparison.
Frequently Asked Questions
Should we evaluate free AI tools with the same rigor as paid ones?
Yes — potentially with even more rigor. Free tools have a business model somewhere, and if you're not paying with money, you may be paying with data. Free tiers of commercial AI tools often have weaker privacy protections than paid tiers (less restrictive data use policies, model training on free-tier inputs, limited DPA availability). Apply the same privacy screening and DPA requirements regardless of cost. Some genuinely free educational tools (Khan Academy, certain Google Workspace features) have strong privacy practices — but verify, don't assume.
How do we evaluate vendors that use OpenAI or Anthropic as their underlying AI?
Many educational AI vendors are wrappers around major AI models (GPT-4, Claude, Gemini). This isn't inherently problematic, but it means your data flows through both the vendor AND the underlying AI provider. Ask: (1) Does the vendor have a DPA with their AI provider that extends your district's privacy protections? (2) Is your data excluded from model training at both the vendor level AND the underlying AI provider level? (3) Where is the data processed — the vendor's infrastructure or the AI provider's? The vendor's DPA with you is only as strong as their agreement with their sub-processor.
What's the minimum pilot duration?
Four instructional weeks is the minimum for meaningful results. Shorter pilots don't capture the learning curve — teachers need at least 2 weeks to get past initial friction, and you need at least 2 more weeks of data on actual productive use. If the vendor pushes for a 1-2 week pilot, they may be trying to capture the honeymoon period before teachers encounter the tool's limitations. Insist on 4 weeks minimum, and include at least one full instructional cycle (unit or grading period) if possible.
How do we handle teachers who want to buy AI tools with personal funds or classroom budgets?
This is a governance issue, not a technology issue. Any AI tool that accesses student data must go through the district's privacy review regardless of who pays for it. Establish a clear policy: all AI tools used in instructional settings require IT approval, even free tools, even personally purchased tools. This isn't bureaucracy — it's liability management. A teacher using an unapproved AI tool that experiences a data breach creates liability for the district, not just the teacher. Provide a streamlined approval path (48-72 hours for tools on an approved list) so teachers don't feel the process is a barrier.