In the fall of 2021, a headline crossed the education press that read, in effect, that the average SAT had ticked upward during the worst disruption to American schooling in a century. Read at face value, it was a small miracle: schools shuttered, juniors and seniors learning algebra over patchy video calls, entire testing windows wiped off the calendar, and yet the reported mean held steady or nudged higher. The natural reading is that students got better, or at least held the line, under impossible conditions. That reading is almost certainly wrong, and understanding why it is wrong is the single most useful piece of statistical literacy a test-taker can carry into a conversation about SAT score trends 2020 to 2026.

Here is the claim this article defends, stated plainly so you can hold the rest of the analysis against it. The headline averages from this period are a poor measure of whether American students learned more or less, because the group of people who sat the exam changed faster than the performance of any individual within it. When colleges dropped the requirement, the students most likely to skip the test were the ones expecting lower marks, and their absence lifted the remaining mean without a single candidate answering one more question correctly. Layer the pandemic’s cancelled administrations on top of that, then the move to a digital format that subtly changed the testing experience, and you get a reported number that moves for at least three reasons that have nothing to do with the thing everyone assumes it measures. The InsightCrunch population-versus-performance distinction is the lens this article hands you: before you read any year-over-year change in a national average as a verdict on ability, separate the change in who tested from the change in how well testers did. Almost nobody in the popular coverage does this, and it is why so much of what gets written about this period is confidently mistaken. By the end you will be able to read a trend line, your own score report included, the way a careful analyst reads it rather than the way a headline writer does.
What “SAT Score Trends” Actually Measures
When people search for SAT average score history, they usually picture a single line moving up or down across years, the way a stock chart moves. The reality is messier and far more interesting, because a national average is a quotient: a sum of every reported result divided by the number of people who produced one. Both the numerator and the denominator move every year, and they move for different reasons. The numerator shifts when individual performance shifts, when the difficulty of the form shifts, or when the mix of strong and weak candidates shifts. The denominator shifts when participation shifts. A trend line that bundles all of these into one number is not lying, but it is answering a question almost nobody is actually asking.
The period from 2020 through 2026 is the most analytically hazardous stretch in the modern history of the assessment precisely because all of these levers moved at once, and several of them moved hard. The coronavirus pandemic shut down testing sites for months. Hundreds of colleges suspended their requirement that applicants submit a result at all. The College Board retired the optional essay and the subject-area exams. And then, capping the period, the entire exam migrated from paper and pencil to an adaptive digital form delivered through a dedicated application. Each of these events changed either who tested or how they tested or both, and each left a fingerprint on the reported mean. To read the period honestly you have to attribute the movement to its real cause rather than to the convenient story.
Why is the test-taker population the first thing to check?
The population is the first thing to check because a change in who sits the exam can move the reported average more than any plausible change in actual skill. If the lowest-scoring fifth of a cohort simply does not show up, the mean of everyone who remains rises mechanically, even though no remaining candidate improved. During the 2020 to 2026 window the participating group churned dramatically, so population change is the prime suspect for almost every headline swing.
That single idea, that the denominator and the composition of the numerator can move independently of performance, is the analytical key to the whole period. Hold onto it, because every worked illustration below is a variation on it. Counselors building a reading list and parents trying to interpret a child’s report both stumble on the same rock: they treat the national figure as if it were a thermometer reading the ability of the country’s seventeen-year-olds, when it is closer to a survey whose respondent pool was reshuffled every single year.
It helps to fix the cast of characters. The College Board administers the exam and publishes the annual data, most visibly in its yearly report on the full group of graduating seniors who took the test at least once. That report is the canonical source for any figure in this article, and every number quoted here should be confirmed against the edition of that report covering the graduating class in question, because the published values are periodically revised and because the cohort definition matters. Colleges set the demand side: their requirement, or the suspension of it, determines how many students bother to register. Students respond to that demand with their feet, deciding whether the exam is worth the Saturday. State education departments add a wrinkle, because several states pay for every public-school junior to sit the exam during the school day, which floods the pool with universal participation that behaves very differently from the self-selected pool of college-bound volunteers. When a state adds or drops a school-day contract, its entire cohort’s average can lurch overnight for reasons that have nothing to do with teaching or learning.
How does universal school-day testing distort a state average?
Universal school-day testing distorts a state average by removing the volunteer filter. In a state where only college-bound students pay to test, the pool is pre-selected toward higher performers and the average sits high. When a state mandates testing for every junior, the pool now includes students with no college plans, and the state average falls sharply, not because instruction worsened but because the measured group expanded to the whole class.
This is why comparing two states’ averages tells you almost nothing about the quality of their schools, and why comparing one state across a policy change tells you almost nothing about a trend in learning. The participation rate is the hidden variable, and it has to sit beside every average before that average means anything. Across the 2020 to 2026 window, participation rates moved in both directions for different reasons in different places, which is one more strand to untangle before the national figure becomes interpretable. The same caution that applies to comparing states applies to comparing years, and it is the discipline this article is built to instill.
The Mechanics of an Average, a Percentile, and an Equated Score
To read the trend you have to know what the reported number is built from, so this section pulls apart the three quantities people casually conflate: the raw performance, the scaled score, and the percentile. Each behaves differently across years, and confusing them produces most of the bad takes about the period.
A scaled total on this exam runs from 400 to 1600, the sum of two section results that each run from 200 to 800. The scaled figure is not a raw count of correct answers. It is the raw count passed through an equating process that adjusts for the difficulty of the particular form a student saw, so that a given scaled result is meant to represent the same level of demonstrated skill regardless of whether that day’s questions ran slightly harder or slightly easier than average. Equating is the reason a scaled result is comparable across test dates within a year and, in principle, across years. It is also frequently misunderstood: students imagine equating as a curve that rewards them when peers do poorly, but it is keyed to the statistical difficulty of the items, established in advance through pretesting, not to the live performance of the room. Equating is what lets the College Board claim, with justification, that a 1300 means roughly the same demonstrated proficiency in one administration as in another.
Does a percentile measure the same thing as a scaled score?
A percentile does not measure the same thing as a scaled score. A scaled result is anchored to demonstrated proficiency through equating, so it is meant to be stable in meaning over time. A percentile is purely relative: it reports the share of a reference group that scored at or below a given result. Change the reference group, and the percentile attached to a fixed scaled result changes even though the proficiency it represents has not.
Sit with that distinction, because it is the hinge of one of the most important worked illustrations later in this article. The scaled result is supposed to be a fixed yardstick. The percentile is a ranking against a crowd, and the crowd is exactly the thing that churned most violently across this period. When the lower portion of the crowd stays home, every fixed scaled result climbs in percentile terms, because the same number of correct answers now sits above a larger share of a smaller, stronger field. A student who scored a 1200 in one year and a friend who scored an identical 1200 three years later did equivalent work in the equated sense, yet they can occupy meaningfully different percentile ranks, and the later student may rank lower precisely because the comparison group had shifted back toward universal participation. This is not a flaw in the exam. It is what a percentile is. But it is invisible to anyone reading only the headline average, and it quietly distorts how families judge whether a result is competitive.
The College Board complicates the picture further by publishing more than one percentile. There is a nationally representative percentile, built from a research sample meant to model all students in the relevant grade whether or not they tested, and there is a user percentile, built from the actual population of recent test-takers. The two can differ by several points at the same scaled result, and they answer different questions. The representative percentile asks how a student stacks up against all seventeen-year-olds in the country; the user percentile asks how they stack up against the self-selected group who chose to take the exam. During a period when the self-selected group was shrinking and strengthening, the user percentile for any fixed result was under particular pressure, while the representative percentile, anchored to a stable model of the whole age group, moved far less. A family comparing a child’s result against the wrong percentile, or against a percentile from a different year, can badly misjudge where that result actually stands. When you read your own report, confirm which percentile you are looking at and which reference year it draws on before you let it shape a decision.
What is the difference between a mean and a median in score reporting?
The mean is the arithmetic average, the sum of all results divided by the count, and it is sensitive to extremes and to shifts in the tails of the distribution. The median is the middle value, the point at which half the field scored higher and half lower, and it is far less sensitive to whether a particular slice of low scorers showed up. National reporting leans on the mean, which is exactly why it moves so readily when the composition of the pool changes; a median would have told a steadier story across this turbulent stretch.
That asymmetry matters for the period under study. Because the reported figure is a mean, the disappearance of low scorers in the test-optional surge had an outsized lifting effect; a median would have shrugged off much of that movement. Whenever you see a national average jump in a year when participation dropped, the mean’s sensitivity to tail composition should be the first explanation you reach for, ahead of any story about students suddenly learning more. The mechanics of the average, not the talent of the cohort, drive a surprising amount of the visible trend.
The InsightCrunch Score-Trend Timeline, 2020 to 2026
What follows is the findable artifact at the center of this piece: a year-by-year reading of the period that pairs the direction of participation with the direction of the reported average and, crucially, names the dominant cause for each year. Every figure in the surrounding discussion should be confirmed against the College Board’s annual report for the matching graduating class, because the published values are revised over time and because the exact mean depends on the cohort definition. The table flags this throughout. Treat the directions as the reliable signal and the specific numbers as pointers to verify, never as quotations to repeat.
| Graduating class | Test-taker volume (direction) | Reported average (direction) | Dominant cause that year (flag for verification) |
|---|---|---|---|
| 2019 (baseline) | Near the historic peak, around 2.2 million | Baseline, roughly in the high 1050s | Pre-pandemic steady state; broad participation including many school-day states |
| 2020 | Slight dip from the peak | Roughly flat to slightly lower | Class had mostly tested before March closures; spring administrations cancelled late |
| 2021 | Sharp drop, toward the mid 1.5 million range | Flat to slightly higher | Test-optional surge plus cancelled sites; low scorers opt out, lifting the mean |
| 2022 | Partial rebound in volume | Slightly lower than the prior year | Participation recovering, pulling the lower tail back into the pool |
| 2023 | Continued recovery, toward the high 1.8 to 1.9 million range | Lower, a visible decline | Pool re-broadening toward universal participation; tail returns in force |
| 2024 | Near or above 1.9 million | Roughly flat to slightly lower | First US digital administrations in spring; fatigue reduced but pool still broad |
| 2025 | Stable to slightly higher | Roughly flat | Digital format settled; participation policies stabilizing |
| 2026 (provisional) | Stable | To be confirmed against the published report | Settled digital era; analyze composition before reading any change |
Read the table as a whole and a pattern jumps out that the year-by-year headlines obscured at the time. The average rose in the year participation collapsed and fell in the years participation recovered, which is precisely the inverse of what you would expect if the average tracked the ability of the country’s students. Ability does not swing that fast in a single graduating class, and certainly not in a way that happens to move opposite to a policy change. The cleaner explanation is that the composition of the pool was doing the work. This is the population-versus-performance distinction rendered in data, and it is why the InsightCrunch reading of the period treats every annual change as a composition question first and a performance question only after the composition has been accounted for.
Worked illustration one: how opt-outs raise an average with nobody improving
Take the cleanest version of the effect so the mechanism is undeniable. Imagine a small cohort of ten students whose results, in a year when everyone tested, were 900, 1000, 1050, 1100, 1150, 1200, 1250, 1300, 1400, and 1500. The sum is 11,850 and the mean is 1185. Now suppose the requirement is dropped, and the four lowest scorers, the 900, 1000, 1050, and 1100, decide the exam is not worth their Saturday because no college they are applying to demands it. Six students remain, with results of 1150, 1200, 1250, 1300, 1400, and 1500. Their sum is 7800 and their mean is 1300. The reported average has climbed from 1185 to 1300, a jump of 115 points, and not one student answered a single additional question correctly. Every individual performed exactly as before. The entire movement came from who chose to participate.
The principle that generalizes is the one to carry into every headline you read about this stretch: when participation is voluntary and the people most likely to opt out are the lower scorers, the reported mean rises as a pure artifact of self-selection. The size of the lift depends on how many opt out and how far below the mean they sat. In the real cohorts of 2021, the opt-out was not as extreme as four in ten, and the scores were not as tidy, but the direction was identical and the cause was the same. Whenever you see an average rise in a year participation fell, reach for this illustration before you reach for any story about resilience or improvement.
Worked illustration two: the same score, two different percentiles
Now take a student, call her the 2021 tester, who earned a 1350. Suppose that in the thin, self-selected pool of 2021, a 1350 placed her at roughly the 92nd user percentile, meaning about 92 percent of that year’s test-takers scored at or below her. Three years later her younger sibling earns an identical, equated 1350. But by then participation has recovered toward universal levels, the lower tail is back in the pool, and a 1350 now sits against a broader, lower-skewed field. The equated proficiency is the same, yet the user percentile attached to it may have shifted. Counterintuitively, in a recovered and broader pool, a fixed high result can sit at a slightly different rank than it did in a thin pool, because the shape of the distribution around it changed. The two siblings did equivalent work and hold the same scaled number, but the percentile, the relative ranking, is not guaranteed to match.
The principle here is that a percentile is a statement about a crowd, not about the student, and the crowd was a moving target across this period. When a family digs up an older percentile chart, or compares a current result against a relative’s result from a different year, they are comparing rankings drawn from different populations. The fix is mechanical: always read a percentile against the reference group and year printed on the same report, never against a chart from another era. The scaled result is your stable yardstick; the percentile needs its year attached to mean anything at all. This is the same-score-different-percentile effect, and it is one of the quietest sources of family anxiety in the whole period.
Worked illustration three: did the digital format move scores, and which way?
The migration to a digital, section-adaptive form is the third lever, and it is the one most prone to wishful or fearful thinking. The honest analysis starts by separating what the format change plausibly affects from what it does not. The digital form is shorter in clock time and shorter in item load than the paper version it replaced, and it delivers each section in two stages where the second stage adapts in difficulty to first-stage performance. The shorter sitting reduces fatigue, and reduced fatigue tends to help performance at the margin, especially on the back half of a long section where tired students historically bled points. So the directional expectation, before looking at any data, is a small upward nudge from reduced fatigue, not a large swing.
What the data should show, and what every figure here flags for verification against the published reports, is exactly that kind of modest effect rather than a dramatic one. The equating process is designed to hold the meaning of a scaled result constant across the paper-to-digital transition, so a 1200 on the digital form is meant to represent the same proficiency as a 1200 on the paper form. If the published averages had leapt the year the digital form arrived in US administrations, the right suspicion would not be that students suddenly improved but that either the composition shifted or the equating was still settling. The cleaner reading is that the format change contributed a small, fatigue-driven improvement at the individual level that is easily swamped by the much larger composition effects from participation swings. The principle: a format change that preserves equated meaning should move a national average only modestly, and any large movement coinciding with it is far more likely to be a population story wearing a format costume.
Worked illustration four: reading a demographic gap without editorializing
The period’s data also speaks to gaps between demographic groups, and this is where careful, neutral reading matters most, because the topic is freighted and the statistics are easy to misuse in either direction. The factual frame is this: average results have long differed across groups defined by family income, parental education, and race or ethnicity, and these differences predate the pandemic by decades. The analytical question for this period is narrow and answerable: did those gaps widen, narrow, or hold across 2020 to 2026, and how much of any movement is composition rather than performance?
The same self-selection logic that governs the overall mean governs the gaps. If opt-outs during the test-optional surge were not evenly distributed across groups, and there is reason to think they were not, since access to testing sites and information about which colleges still required scores varied, then a measured narrowing or widening of a gap in 2021 could reflect differential participation rather than differential learning. A group whose lower scorers opted out at a higher rate would show an inflated average that year, narrowing the visible gap without any change in underlying instruction. As participation recovered, that artifact would unwind, and the gap could appear to widen again purely as the pool rebroadened. The neutral, sourced reading is therefore cautious: report the direction of the published gap figures, attribute as much of the movement as the participation data can explain to composition, and reserve any claim about changes in opportunity or instruction for evidence that controls for who tested. The principle that generalizes is the discipline of the whole article applied to a sensitive case: separate population change from performance change before drawing any conclusion, and do it most rigorously where the stakes of getting it wrong are highest.
Reading Your Own Result the Way an Analyst Would
The point of all this is not to win an argument at a dinner party about national trends. It is to read your own report correctly, because the same artifacts that distort the headline distort the way you judge whether your result is good enough for the colleges on your list. The InsightCrunch population-versus-performance distinction has a personal version, and applying it well is worth real points of clarity in your planning.
Start with the scaled result, because it is the stable quantity. Your equated total is meant to represent the same proficiency regardless of which form you sat or which Saturday you chose, and it is the number a college’s admissions reader actually evaluates against their own band. Do not let a percentile, yours or anyone else’s, tempt you into thinking the scaled number means more or less than it does. The detailed mechanics of how that scaled total is assembled from the two sections, and how the adaptive second module shapes your ceiling, are worth understanding in their own right, and the full treatment lives in our complete guide to preparing for the exam at how to read and act on a Digital SAT score report, which sets the scaled result in the context of a whole preparation plan.
How should the test-optional era change how I read my percentile?
The test-optional era should make you treat any percentile as provisional until you confirm its reference year and reference group. A percentile from a thin, self-selected pandemic-era pool ranks you against a stronger, smaller field than a percentile from a recovered, near-universal pool. If you are comparing your rank to an older sibling’s or to a chart you found online, you are likely comparing against the wrong crowd. Read only the percentile printed on your own current report, and even then, weight the scaled result more heavily in your decisions.
The practical move is to anchor your planning to scaled bands rather than to ranks. When you research a college’s published middle fifty percent, the 25th to 75th percentile range of admitted students’ results, you are looking at scaled numbers, and your scaled number is what you compare against them. That comparison is stable in a way the national percentile is not, because the college’s band is itself a scaled range and shifts slowly. The submit-or-withhold decision that test-optional admissions now demands of every applicant turns on exactly that comparison, and getting the percentile noise out of your head is the precondition for making it well.
Now turn the trend analysis into a study plan, because reading the period correctly also tells you what to actually do. The composition story means you cannot use the national average as a target; it is not a goal, it is a moving artifact. Your target is the scaled band of your colleges, full stop. To move toward that band, you need to know where your own points are leaking, and that is a diagnosis the national data cannot give you. The method that works is the same one that works in any era: take official, full-length practice under realistic conditions, then analyze every miss by cause. Our walkthrough of how the math section’s question patterns have evolved across recent years, at the recurring math item types worth drilling first, shows how to turn pattern recognition into a prioritized study list, and the companion analysis of the verbal section at the reading and writing patterns that repeat across forms does the same for the other half of the exam. Pattern analysis is the individual-level version of trend analysis: instead of asking what the country’s results are doing, you ask what your own results are doing across forms, and you act on the answer.
What should I actually do with a practice result before test day?
With a practice result, sort every missed item into one of three buckets before doing anything else: a content gap you genuinely did not know, a careless error you would catch on a calm second look, and a timing casualty you rushed or never reached. The bucket counts tell you what to fix. A content-heavy profile means drilling the weak topic; a careless-heavy profile means slowing your verification; a timing-heavy profile means rebuilding your pacing. The national trend is irrelevant to this; your own error profile is everything.
That is where realistic, feedback-rich rehearsal earns its place, and it is the obvious next action once you have a diagnosis. The fastest way to convert a study list into points is to drill the exact item types you are missing under timed conditions and check your reasoning against worked solutions immediately, while the question is still fresh, so practicing on a tool that delivers section-targeted question sets with full explanations across both Math and Reading and Writing through the ReportMedic SAT practice hub turns your error analysis into rehearsal rather than leaving it as a list of good intentions. The discipline is to practice the diagnosed weakness, not the comfortable strength, and to re-run the error analysis every few sessions so you can watch your own trend line move for reasons you actually control.
The contrast with the national figure is the lesson worth keeping. The country’s average moved across 2020 to 2026 for reasons no student controlled: who showed up, which colleges required a result, when the digital form arrived. Your own scaled result moves for reasons you do control: which topics you master, how carefully you verify, how well you pace. Reading the period correctly frees you from chasing a number that was never about you, and points you at the small set of levers that actually are.
Edge Cases That Separate a Complete Account From a Tidy One
The clean version of the self-selection story is true, but the period has wrinkles that a careful reader should know, because admissions offices and statisticians do, and because a few of them cut against the simple narrative.
Superscoring is the first wrinkle. Many colleges combine a student’s best section results across multiple sittings into a single composite, which means the result a college records for an applicant can exceed any single-sitting total that student ever earned. When superscoring became more common and students took the exam more times to feed it, the composites colleges reported in their own admitted-student bands drifted upward independently of any change in single-sitting performance. This is yet another composition effect, but operating at the college level rather than the national level: the bands you research may sit slightly higher than the underlying single-sitting reality, because they reflect best-of-multiple assembly. When you compare your single-sitting result to a college’s published band, remember the band may be built from superscored composites, which argues for the same multiple-sitting strategy on your side.
Does test-optional mean the same thing as test-blind?
Test-optional and test-blind are not the same, and conflating them produces real errors. Test-optional means a college will consider a result if you submit one but does not require it; a strong result still helps. Test-blind means the college will not look at a result even if you send it; submitting is pointless. During this period most colleges that dropped the requirement went test-optional, not test-blind, which means the exam retained real value for applicants whose results strengthened their case. The distinction governs whether testing is worth your time at all for a given school.
That distinction matters for the self-selection story too. Under test-optional policies, the students who submitted were disproportionately those with strong results, while those who tested and scored lower simply withheld, which is a second layer of self-selection on top of the decision to test at all. So the results colleges actually saw in applications skewed even higher than the results students actually earned, because withholding filtered the visible pool a second time. The national test-taker average and the average of submitted results are different numbers, and the gap between them widened in the test-optional era. A complete account keeps these two populations separate: everyone who tested, and the subset who chose to submit.
International cohorts are a third wrinkle, and they intersect with the digital transition in a specific way. The digital form rolled out to international administrations before it reached US administrations, so for a stretch the international and domestic pools were sitting different forms. International test-takers are also a heavily self-selected group to begin with, since they are applying across a system for which the exam is optional and expensive, which means their averages have always run differently from the domestic pool and respond to different pressures. Any analysis that blends international and domestic results without flagging the difference is mixing two populations that moved on different schedules for different reasons, and the blended figure can mislead in both directions.
A fourth and easily missed wrinkle is the concurrent-enrollment and school-day churn already introduced. Several states added, expanded, or in some cases reconsidered universal school-day administration across this period, and each policy change moved that state’s participating pool and therefore its average. Because these changes were staggered across states and years, they inject state-level noise into the national figure that has nothing to do with national learning trends. When a national average wobbles in a year with no obvious pandemic or format cause, a shift in which states were running universal testing is a candidate explanation worth checking before reaching for any story about ability.
Why did some reported figures get revised after first publication?
Some reported figures were revised after first publication because the cohort and the data are finalized over time, late results are incorporated, and definitions are occasionally adjusted. An early figure for a graduating class can differ from the settled figure in the final report. This is ordinary statistical practice, not error, but it means any number you cite should come from the finalized report for that class, which is exactly why every figure in this article is flagged for verification rather than asserted as fixed.
Wider Significance: What the Period Tells Us About the Exam and Admissions
Zoom out from the individual report and the period carries a larger lesson about how the assessment fits the admissions system, and about a debate that the pandemic forced into the open and that is still unresolved as the digital era settles.
The test-optional movement did not begin with the pandemic; a slow accumulation of colleges had been dropping the requirement for years on the argument that results correlate with family income and therefore embed an advantage that has nothing to do with a student’s potential. What the pandemic did was convert a gradual trend into an overnight near-universal policy, because cancelled administrations made it physically impossible for many applicants to test even if they wanted to. Colleges that had debated the question for a decade resolved it in a single admissions cycle out of necessity. That forced experiment is the most valuable thing the period produced, because it generated several years of admissions outcomes under optional policies, and those outcomes are now the evidence base for the debate about whether to keep, drop, or reinstate the requirement.
Is the test-optional shift permanent or temporary?
The test-optional shift is neither cleanly permanent nor cleanly temporary; the honest answer is that it fractured. Some colleges committed to optional or test-blind policies for the long term, some quietly let temporary suspensions lapse back into requirements, and a visible group of selective institutions reinstated the requirement after concluding that results added predictive value their applications otherwise lacked. As the digital era settled, the landscape was mixed rather than uniform, which means an applicant must check each college’s current policy individually rather than assuming a single national rule.
That reinstatement wave is the part of the story the early test-optional coverage did not anticipate, and it is where the debate gets genuinely interesting, so it deserves a fair hearing on both sides before any verdict. The case for keeping the requirement, made by several selective colleges that reinstated it, rests on evidence that a result adds predictive information about college performance beyond what grades and essays carry, and that this is especially true for identifying high-potential students from under-resourced schools whose transcripts are hard to interpret without a common yardstick. On this view a standardized result, read in the context of a student’s circumstances, can surface talent that a grade-inflated or unfamiliar transcript hides, and dropping it can paradoxically disadvantage exactly the students test-optional policies meant to help. The argument is not that the exam is unbiased; it is that grades and essays are also shaped by resources, and that a result read in context adds signal rather than noise.
The case against the requirement, made by colleges that kept optional or test-blind policies, rests on the strong and well-documented correlation between results and family income, on the coaching and retesting advantages that money buys, and on the observation that dropping the requirement widened and diversified applicant pools without obviously degrading the quality of admitted classes. On this view the exam measures opportunity as much as aptitude, and removing it lets a college evaluate the whole student without anchoring on a number that tracks a family’s resources. Both sides marshal real data, and the disagreement is partly empirical and partly about values, about how much predictive signal justifies how much access cost.
The measured verdict the evidence supports is narrower than either camp’s rhetoric. A result clearly carries predictive signal, and read in the context of a student’s school and circumstances it can help rather than harm equity, which is the strongest argument for the reinstaters. At the same time the access barriers are real, and a requirement imposed without fee waivers, accessible testing sites, and contextual reading would reproduce exactly the inequities critics fear. The defensible position is that a result is valuable when read in context and ruinous when read as a raw threshold, and that the policy question is less whether to require a result than whether a college has the contextual-reading machinery to use one fairly. The exam is a tool whose effect depends entirely on how it is read, which is, fittingly, the same lesson the trend data teaches: a number is only as good as the reading applied to it.
For a student navigating this fractured landscape, the practical implication is unambiguous even though the policy debate is not. Because a strong result helps at most colleges and hurts at none that are merely test-optional, and because a growing set of selective colleges have reinstated the requirement outright, the prepared applicant tests, prepares to do well, and then decides school by school whether to submit. The decision rule is the submit-or-withhold comparison against each college’s scaled band, applied individually. The trend debate plays out at the level of national policy, but the applicant’s job is local: know your scaled result, know each college’s current policy and band, and act accordingly.
How does this period connect to the PSAT and to long-term planning?
This period connects to long-term planning through the related assessment taken earlier in high school, which serves as a low-stakes preview and, for the highest scorers, a scholarship qualifier. The same composition cautions apply to its reported figures, since who sits it also shifted across the pandemic, but its role in a student’s timeline is unchanged: it is an early diagnostic that flags strengths and gaps in time to act on them. A student reading the national trend correctly understands that their preview result, like the national average, must be read against its own reference group and treated as a starting diagnosis rather than a verdict.
The longer arc here is that the exam survived a period that many predicted would end it, emerged in a digital form, and settled into an admissions system that uses it more selectively and, at the better-run colleges, more thoughtfully than before. The trend data is the record of that turbulence, and reading it correctly is a small piece of the larger literacy this series argues for: the exam is a learnable, analyzable system, and so is the data about it. A student who can separate population change from performance change in a national average is exercising exactly the analytical discipline that, turned on their own practice results, raises a score.
The Equating Question and Why Concordance Is Not a Curve
One technical strand deserves its own treatment because it generates persistent confusion that warps how people read the whole period: the relationship between the paper exam, the digital exam, and the idea of concordance. When the format changed, a natural worry arose that a digital result and a paper result were not comparable, and that the trend line therefore broke at the transition. The reassurance and the caution both matter here.
The reassurance is that the scaled result is engineered to mean the same thing across the transition. The College Board’s equating and the published concordance work exist precisely so that a 1300 on the digital form represents the same demonstrated proficiency as a 1300 on the paper form, which keeps the trend line interpretable across the break. A college reading a digital result against a band built partly from paper-era admitted students is, by design, comparing like with like. This is why the format change should not, on its own, produce a discontinuity in the national average; the equating absorbs the format difference.
Why is equating not the same as grading on a curve?
Equating is not grading on a curve because it is keyed to the established difficulty of the questions, not to how the live test-room performed. The difficulty of each item is determined in advance through pretesting, and equating adjusts for the fact that one form ran slightly harder or easier than another so that the same proficiency yields the same scaled result. A curve, by contrast, would rank you against the people who happened to sit beside you that day. Your result does not improve because your particular room struggled; it reflects your demonstrated proficiency against a pre-established difficulty standard.
This distinction matters for trend reading because it kills a tempting bad explanation. People sometimes attribute a rising average to an easier form or a falling average to a harder one, imagining that the exam got easier or harder year to year and dragged the mean with it. Equating is specifically designed to prevent that: if a form runs easier, the equating raises the raw count needed for a given scaled result, neutralizing the difficulty difference. So a genuine movement in the equated national average cannot be explained away as a difficulty drift, which throws the explanatory weight back onto the two causes that actually move the number: composition and, far more modestly, real performance change. The cleaner you are about equating, the more clearly you see that composition is doing the heavy lifting across this period.
There is a real caveat to hold alongside the reassurance. Equating across a major format change is harder than equating across two paper forms, because the digital form differs in length, in adaptivity, and in the testing experience, and the bridging studies that establish the cross-format comparison carry more uncertainty than within-format equating. The responsible reading treats the cross-format comparison as solid enough to interpret the trend but acknowledges a band of uncertainty around the transition year specifically, which is one more reason to confirm transition-year figures against the finalized reports and to avoid over-interpreting a single year’s movement right at the format break.
The Same Story Played Out on the ACT, Which Confirms the Mechanism
A useful way to test whether the self-selection reading is right rather than just plausible is to look at whether the same dynamic appears on the other major American college-entrance exam, the ACT. If the pandemic-era rise in the average were really about students improving, you would expect the two exams to tell different stories, since they draw somewhat different populations and run on different schedules. If instead the rise were about composition, you would expect both exams to show the same pattern, because the same colleges dropped requirements for both at the same time, and the same lower-scoring students opted out of both. The parallel is close enough that our broader comparison of how the two American exams differ in structure and scoring, set out alongside the international alternatives in the ACT and SAT differences explained section of the complete guide, is worth keeping in view as you read.
The ACT reports its results on a different scale, a composite from 1 to 36 rather than 400 to 1600, but the trend logic is identical. Across the pandemic years the ACT’s reported national average behaved the way the SAT’s did: it was buffeted by the same collapse and recovery in participation, and the same test-optional surge that thinned the SAT pool thinned the ACT pool too. The fact that two exams with different content, different scales, and partly different geographic footprints moved in the same direction at the same time is strong evidence that the common cause, the change in who tested, was driving both, rather than some content-specific or exam-specific story. When two independent measurements move together, the explanation usually lies in what they share, and what these two exams shared during this period was the population of students deciding whether to test at all.
Does the ACT comparison change the conclusion about ability?
The ACT comparison reinforces rather than changes the conclusion. Both exams showed participation-driven movement in their averages across the same years, which is exactly what the self-selection mechanism predicts and exactly what a genuine ability-change story would not predict, since there is no reason a real shift in student learning would respect the boundary between two different exams so neatly. The convergence of the two trend lines is corroborating evidence that composition, not performance, was the dominant cause of the period’s headline movements.
There is a further analytical bonus in the comparison. Because some students take both exams and some take only one, and because the choice between them is itself somewhat self-selecting, a careful analyst can use the relationship between the two pools to sanity-check claims about either. A trend that appears on one exam but not the other deserves an exam-specific explanation; a trend that appears on both points to a shared cause in the population. This kind of cross-instrument checking is exactly the discipline a researcher or education journalist should apply before publishing a claim about either exam’s trend, and it is the kind of move the thin coverage of the period almost never made. The takeaway for a student is smaller but still useful: do not read either exam’s national average as a verdict on your generation’s ability, because both numbers were measuring a shifting crowd, not a fixed group of learners.
How to Verify These Numbers Yourself, the Researcher’s Method
Because every figure in this article is flagged for verification rather than asserted as fixed, it is worth spelling out exactly how a careful reader confirms a trend claim, since the ability to do this is what separates a citable account from a repeated rumor. The method is not complicated, but almost nobody applies it, which is why so much bad analysis circulates.
Begin with the canonical source. The College Board publishes an annual report on the full group of graduating seniors who took the exam at least once, and that report is the authoritative source for the national average, the participation count, and the percentile tables for that graduating class. When you encounter a claim about a year’s average, the first move is to find the edition of that report covering the matching class and read the figure directly rather than trusting a secondhand summary. Secondhand coverage frequently quotes provisional figures, misattributes a class year, or conflates the full group with a subgroup, and the only protection is going to the primary document.
How do I tell whether a reported figure is final or provisional?
You tell whether a figure is final or provisional by checking whether it comes from the finalized annual report for a completed graduating class or from an early release issued before the class was complete. Early releases circulate before all results are incorporated and before late definitional adjustments are made, so they can differ from the settled number. A figure is trustworthy when it comes from the finalized report for a class whose testing window has closed; a figure quoted mid-cycle, or attributed vaguely to a year without a class designation, should be treated as provisional and confirmed against the final document.
Once you have the right figure, the second move is to read it beside its participation count, never alone. A national average means almost nothing without the count of who produced it and the participation rate, because those are the variables that let you tell a composition story from a performance story. If the average rose while participation fell, you have a self-selection candidate; if both rose together, you have a different situation worth investigating; if the average moved while participation held steady, then a performance or equating explanation becomes more plausible. The pairing of average with participation is the single most important analytical habit for this period, and it is the one most consistently omitted in popular coverage that quotes the average as a naked number.
The third move is to confirm which percentile and which reference group any ranking claim uses. Recall that there is a nationally representative percentile and a user percentile, and they answer different questions and can differ by several points at the same scaled result. A claim that a given result sits at a given percentile is incomplete until you know which percentile and which year, because the answer changes with both. A researcher who quotes a percentile without these qualifiers is, strictly, not making a verifiable claim. The same applies to you reading your own report: confirm which percentile it shows before you let it shape a decision.
The fourth move, for anyone comparing across the format transition, is to treat transition-year figures with the extra caution the cross-format equating warrants and to look for the bridging-study documentation rather than assuming the comparison is as tight as a within-format one. This is the point where responsible analysts add an explicit uncertainty band and where careless ones declare a clean break or a clean continuity without evidence. The honest position acknowledges the comparison is solid enough to interpret the trend while flagging that the transition year carries more uncertainty than its neighbors.
Applied together, these four moves turn you from a consumer of trend claims into someone who can check them, which is exactly the literacy this article argues a test-taker should carry. The same skepticism you bring to a national headline you should bring to your own report, reading the scaled result as the stable quantity, the percentile as a year-stamped ranking, and any comparison across years or formats as a claim that needs its population and its uncertainty made explicit before it means anything.
Turning the Trend Reading Into a Submit-or-Withhold Decision
The most consequential use of all this analysis is the decision every applicant now faces in the fractured test-optional landscape: whether to submit a result to a given college. The trend reading feeds directly into that decision, because it tells you to compare the right quantities and to ignore the wrong ones.
The decision rule is built from scaled numbers, not percentiles or national averages, for all the reasons the trend analysis establishes. For each college on your list, find its published middle fifty percent, the 25th to 75th percentile range of admitted students’ results, which is itself a scaled range and is the most useful single figure a college publishes about how a result fits. Then place your own scaled result against that range. The InsightCrunch submit-or-withhold rule, in its simplest form, is that a result at or above a college’s median, the rough midpoint of its band, clearly helps and should be submitted; a result below the 25th percentile of the band generally does not help at a test-optional school and is usually better withheld; and a result between the 25th percentile and the median is a judgment call that depends on the rest of your application and on whether the college is test-optional or has reinstated a requirement.
Where exactly in a college’s band does my result help or hurt?
Your result clearly helps when it sits at or above the college’s median admitted result, because it places you in the upper half of admitted students on that measure and reinforces your case. It is usually neutral to mildly helpful between the 25th percentile and the median, and the decision there turns on the strength of the rest of your file. It generally does not help, and is often better withheld at a merely test-optional college, when it sits below the 25th percentile, because it places you in the bottom quarter of admitted students and adds a weak signal where silence would have been neutral. At a college that has reinstated a requirement, of course, you submit regardless, because withholding is not an option.
Work a quick example to make the rule concrete. Suppose a college publishes a middle fifty percent of 1350 to 1520, with a rough median near 1440. A student with a 1480 sits above the median, in the upper half of admitted students, and should submit without hesitation; the result strengthens the file. A student with a 1300 sits below the 25th percentile of that band, so at a test-optional college withholding is the stronger play, since the result would place them in the bottom quarter and add a weak signal where none was required. A student with a 1400 sits between the 25th percentile and the median, a judgment call that tips toward submitting if the rest of the application is modest and the result is the strongest element, and toward withholding if the application is already strong and the result would slightly drag the average impression. The rule is mechanical at the extremes and deliberative in the middle, which is exactly where good judgment earns its place.
Notice what this decision does not use: the national average, which is a composition artifact irrelevant to any individual college’s band, and the national percentile, which is a year-stamped ranking against a shifting crowd rather than against the specific college’s admitted students. The trend reading clears these distractions away and leaves you comparing the one stable quantity that matters, your scaled result, against the one external quantity that matters, the college’s scaled band. Everything else is noise the period generated in abundance. A student who internalizes the population-versus-performance distinction makes this decision cleanly, while a student who is still anchored on national figures and old percentile charts makes it badly, submitting weak results out of misplaced confidence or withholding strong ones out of misplaced anxiety. Reading the trend correctly is, in the end, a tool for reading yourself correctly, and the submit-or-withhold decision is where that self-reading pays off in admissions terms.
A Worked Example at the State Level, Where the Effect Is Largest
The composition effect that moves the national average gently can move a single state’s average violently, because a state’s participation policy can flip its entire pool from self-selected to universal in one year. Working a concrete example at this level makes the mechanism impossible to miss and shows why state-to-state comparisons of averages are close to meaningless without participation context.
Picture two neighboring states. In the first, only college-bound students pay to test, so its pool is heavily pre-selected toward higher performers, and its reported average sits high, say in the range a self-selected group produces. In the second, the state education department contracts to administer the exam to every public-school junior during the school day at no cost to families, so its pool includes the entire grade, college-bound or not, and its reported average sits markedly lower, because the measured group now contains students who would never have volunteered. A naive reader compares the two averages and concludes the first state’s schools are far stronger. That conclusion is an artifact of participation policy, not a finding about instruction. The first state measured its top slice; the second measured everyone. They are not the same measurement, and comparing them is comparing a volunteer pool to a census.
Why can a state’s average drop the year it adopts universal testing?
A state’s average drops the year it adopts universal testing because the policy adds the lower-scoring students who previously stayed home, expanding the measured group from a self-selected slice to the whole grade. No student performed worse; the denominator and the composition simply changed to include everyone. The drop is mechanical and should never be read as a decline in the state’s schools. The same logic in reverse explains why a state that drops universal testing sees its average jump: the lower tail stops being measured, and the remaining volunteer pool reports a higher mean.
Now connect this to the national figure across the period. Several states added, expanded, paused, or reconsidered universal school-day administration during these years, and because those policy changes were staggered across different states in different years, they injected movement into the national average that had nothing to do with national learning. A year in which a large state added universal testing would pull the national average down as that state’s whole grade entered the pool; a year in which one paused it would nudge the national average up. These state-level policy shifts are a real and underappreciated source of the national figure’s wobble, and they are invisible to anyone reading only the headline. The careful analyst checks the participation landscape state by state before attributing a national movement to anything about students, which is laborious and exactly why so few do it. The reward for doing it is the only honest reading of the trend.
The practical lesson for a student or parent is to distrust any ranking of states by average result, because such rankings overwhelmingly reflect participation rates rather than school quality. A state where a small, college-bound slice volunteers will always outrank a state that tests its entire grade, regardless of how well either educates. If you want to compare educational outcomes across states you need participation-matched comparisons or entirely different measures; the raw average comparison is one of the most common and most misleading uses of this data, and the same self-selection logic that governs the national trend governs it completely.
What High-Stakes Exams Elsewhere Reveal About Population Effects
The population-versus-performance problem is not unique to the American exam, and a brief look at how other systems handle it sharpens the analysis. Many countries run a single high-stakes university-entrance exam taken by nearly the entire relevant cohort rather than by a self-selected pool, and that structural difference changes what the average means. Where participation is effectively universal and mandatory, the average is a far better measure of the whole cohort’s performance, because there is no opt-out filter distorting who is measured. The trade-off is that universal high-stakes testing brings its own pressures, but the statistical point stands: a near-census average is more interpretable as a performance measure than a volunteer average is.
This is the structural contrast our overview of how the American exam compares to the systems used elsewhere develops in detail, and it is worth understanding because it explains why the self-selection problem is so acute for the SAT specifically. The American exam sits in a system where testing is voluntary, where colleges can and increasingly do make it optional, and where states vary in whether they mandate it, so the participating pool is the product of countless individual and institutional choices rather than a fixed cohort. That voluntariness is exactly what makes the average so sensitive to composition, and it is why the pandemic-era policy changes moved the number so much. An exam taken by everyone would have shown the pandemic’s effect on actual learning far more cleanly, because the measured group would not have churned. The voluntary structure is a feature for access and a bug for measurement, and the period exposed the bug vividly.
Does a near-universal exam avoid the self-selection problem?
A near-universal, mandatory exam largely avoids the self-selection problem because there is no meaningful opt-out: nearly the whole cohort is measured, so the average reflects the cohort’s performance rather than the choices of a volunteer subset. Year-to-year comparisons on such an exam are correspondingly cleaner, since the measured group is stable in composition. The cost is the intense pressure universal high-stakes testing places on students, but on the narrow question of whether the average measures performance, a census beats a volunteer pool every time.
The lesson for reading the American trend is that the self-selection problem is a direct consequence of a voluntary, increasingly optional structure, and that the cure other systems use, universal participation, is not available here and carries costs of its own. So the American analyst is stuck with a volunteer average and must do the harder work of correcting for composition rather than trusting a census. That harder work is precisely the population-versus-performance discipline this article teaches, and the international contrast clarifies why it is indispensable for the SAT in particular: the very flexibility that makes the American exam accessible is what makes its average so easy to misread.
The Settled Digital Era and Reading Trends Going Forward
As the digital form settles and the test-optional landscape stabilizes into its fractured steady state, the question becomes how to read trends going forward, when the violent composition swings of the pandemic years give way to a quieter baseline. The good news is that the same discipline applies; the better news is that it gets easier to apply as the pool stabilizes.
In a settled digital era with stable participation policies, year-to-year movements in the average should shrink, because the largest driver of the pandemic-era swings, the churn in who tested, calms down. When participation holds roughly steady, a movement in the average becomes more interpretable as a genuine signal about performance or about a real shift in the pool’s composition, rather than as a pandemic artifact. This does not mean composition stops mattering; it means composition changes become smaller and slower, so a careful reader can attribute movement with more confidence. The analyst’s job shifts from disentangling violent swings to detecting subtle real signals against a quieter background.
How should I read SAT trends in the settled digital era?
In the settled digital era you should still pair every average with its participation count, but you can place somewhat more weight on a movement as a real signal once you have confirmed participation held steady. With the pandemic churn behind, a sustained drift in the average across several stable years is more plausibly a genuine trend in performance or in the slowly shifting composition of the college-bound pool. The discipline is unchanged, check who tested first, but the conclusions become firmer as the population stabilizes, because there are fewer artifacts to strip away.
The forward-looking caution is that the test-optional landscape is not finished moving. As more selective colleges weigh reinstating requirements, the size and composition of the participating pool will keep adjusting, and each policy shift will leave a smaller version of the pandemic-era fingerprint on the average. A college that reinstates a requirement pulls more students, including some lower scorers, back into the pool, which can nudge the average down for reasons that, once again, have nothing to do with learning. So the reader of future trends should track the policy landscape as closely as the score data, because the two are inseparable: the average is downstream of who tests, and who tests is downstream of what colleges require. The population-versus-performance distinction will remain the master key for as long as the American exam stays voluntary, which is to say indefinitely.
The deepest takeaway from the whole period is that a number’s meaning lives in its method, not on its surface. The national average looked like a measure of student ability and behaved like a measure of participation policy, and only a reader who understood the method could tell the difference. That same gap between surface and method runs through every score report, every college band, and every percentile a student will ever read, and learning to see it on the national scale is practice for seeing it on the personal one. The reader who finishes this article able to ask, of any score statistic, who was measured and against whom, has acquired the single most durable piece of test literacy the period has to teach.
What is a cohort effect, and does it apply here?
A cohort effect is a genuine difference in a graduating class that follows it through its testing, distinct from a composition artifact. The classes that sat the exam during and just after the pandemic experienced real disruptions to instruction, including extended remote learning and uneven access to support, and to the extent those disruptions affected what students actually learned, they could produce a real, not artifactual, depression in the proficiency of those classes. The honest reading holds both possibilities at once: composition artifacts dominated the visible average swings, and a real cohort effect on learning may sit underneath, partly masked by the same self-selection that lifted the average.
Disentangling a real cohort effect from a composition artifact is the hardest analytical task the period poses, and it is where even careful analysts should hedge. The participation-adjusted approach is the only route to an answer: hold the composition of the pool constant, by comparing participation-matched subgroups across years where the data allows, and any residual movement is a candidate for a real cohort effect. Where that adjustment cannot be made cleanly, the responsible verdict is that the data underdetermines the question, that the average is consistent with a real learning loss masked by self-selection but does not by itself establish one. This is a case where the evidence genuinely does not support a confident verdict, and saying so plainly is more honest than forcing a number to mean more than it can. The reader who can hold that uncertainty, neither denying a possible cohort effect nor asserting it beyond the evidence, is reading the period as carefully as it can be read.
The reason this matters beyond statistical pedantry is that policy decisions ride on the answer. If the visible swings were pure composition artifacts, then nothing about instruction needs a response, whereas a real cohort effect would call for genuine remediation. Conflating the two, in either direction, leads to bad policy: panicking over an artifact or ignoring a real loss. The population-versus-performance distinction is therefore not only a tool for reading a headline; it is the precondition for any sound decision about what, if anything, the period’s data demands in response. Getting the measurement right is the first step toward getting the response right, and it is the step the popular coverage most consistently skipped.
Common Mistakes and Myths About the Period, Corrected
The popular account of this period is dense with confident errors, and naming them precisely is the fastest way to inoculate yourself against repeating them.
The first and largest myth is that the pandemic-era rise in the average shows students were resilient or even improved under remote learning. The mechanism that actually produced the rise, demonstrated in the first worked illustration, is self-selection: lower scorers opted out when colleges dropped the requirement, lifting the mean of those who remained without anyone improving. The myth survives because the alternative, a comforting story about resilience, is more pleasant than the truth that the number simply measured a different, smaller, stronger group. Whenever you see this claim, ask what happened to participation that year, and the resilience story collapses.
The second myth is that the digital format made the exam easier and inflated scores. Equating is designed precisely to prevent a format change from inflating results, and the directional expectation from the shorter, less fatiguing sitting is a small individual nudge, not a large inflation. Anyone claiming the digital form handed out free points is ignoring the equating machinery and, usually, attributing a composition effect to the format. The format change is real and worth understanding for how it shapes test-day strategy, but it is not a score giveaway.
What is the single most common error people make reading SAT trends?
The single most common error is reading a change in the national average as a change in student ability, ignoring that the test-taking population churned faster than any individual’s performance could. This error drives almost every bad take about the period, from the resilience myth to the format-inflation myth. The correction is the population-versus-performance distinction: before attributing any movement to learning, account for who tested. Once you internalize that the denominator and the composition of the numerator move on their own, the headlines stop fooling you.
The third myth is that test-optional policies prove the exam is worthless, since colleges admitted classes without it. The evidence is more divided than that: a visible group of selective colleges reinstated the requirement after finding the result added predictive signal they missed, particularly for identifying talent at under-resourced schools. Dropping the requirement did not prove the exam useless; it produced a natural experiment whose results split, which is why the landscape is now mixed rather than uniformly optional. The honest statement is that the exam’s value depends on how it is read, not that it has none.
The fourth myth is that a percentile is a fixed property of a result, so that a 1300 is always, say, the 88th percentile. Percentiles are relative to a reference group and a year, both of which moved across this period, so the same scaled result carried different ranks in different years. Families who compare a current result against an old percentile chart, or against a relative’s rank from a different year, are comparing against the wrong crowd and routinely misjudge where a result stands. Read only the percentile printed on the current report, against its stated reference year.
The fifth myth is that you can use the national average as a personal target. The national figure is a composition artifact that moves for reasons no student controls; it is not a goal. Your target is the scaled band of the specific colleges you are applying to, and your job is to move your own scaled result toward that band by diagnosing and fixing your own error profile. Treating the national average as a finish line is a category error, mistaking a population statistic for a personal objective.
Closing Direction: Read the Number, Then Read Yourself
Return to the 2021 headline that opened this article, the one that read a rising average as a small miracle of resilience. You can now see it for what it was: a composition artifact, the predictable result of lower scorers opting out when the requirement vanished, dressed up as a story about students rising to a challenge. Nothing about that average told you whether any student learned more, because the average was measuring a different, smaller, stronger group than the year before. The whole period is full of numbers like that, true as far as they go and badly misleading if you read them as verdicts on ability.
The discipline that protects you is small and portable: before you read any change in a national average as a change in performance, separate the change in who tested from the change in how well testers did. Apply it to the headline and the resilience myth dissolves. Apply it to your own report and the percentile noise quiets, leaving the scaled result, the stable yardstick you compare against your colleges’ bands. The population-versus-performance distinction is not a trick for arguing about trends; it is the core of reading any score, your own included, correctly.
So here is the exact next action. Pull your most recent practice or official result, write down the scaled total and ignore the percentile for a moment, then sort your missed items into content, careless, and timing, and build your next week of study from whichever bucket is heaviest. Then go run timed, feedback-rich practice on the diagnosed weakness so the analysis becomes points rather than a list. The country’s average will keep moving for reasons you do not control. Your scaled result will move for reasons you do. Spend your attention on the second number, because it is the only one that was ever about you.
Frequently Asked Questions
How have SAT scores changed from 2020 to 2026?
Across the period the reported national average moved up and down in a pattern that tracks participation far more than ability. It held roughly flat through 2020, ticked up in 2021 when the test-optional surge and cancelled administrations thinned the pool of lower scorers, then declined as participation recovered toward universal levels in 2022 and 2023, before settling once the digital form arrived and policies stabilized. The honest summary is that the headline average changed mostly because the group taking the exam changed, not because individual students learned dramatically more or less. Every specific figure should be confirmed against the College Board’s finalized report for the matching graduating class, since published values are revised and the cohort definition affects the exact number. The direction is the reliable signal; the precise points are pointers to verify.
Did average SAT scores change after the digital transition?
The reported average moved only modestly around the digital transition, which is what the design predicts. The digital form is shorter and less fatiguing, so the directional expectation at the individual level is a small upward nudge from reduced exhaustion, not a large swing. More importantly, the equating process is built to hold the meaning of a scaled result constant across the paper-to-digital change, so a given scaled total represents the same proficiency on either form. Any large movement coinciding with the transition is therefore far more likely to be a composition effect, a change in who tested, than a format effect. Treat the transition year with extra caution, confirm its figures against the finalized report, and avoid over-reading a single year’s movement right at the format break, where cross-format equating carries a wider band of uncertainty than within-format equating does.
How did COVID affect SAT testing and scores?
The pandemic’s largest effect was on participation, not on the difficulty or scoring of the exam. Test centers closed for months in 2020, cancelling administrations and making it physically impossible for many students to sit the exam. Colleges responded by suspending their requirement almost universally for the next admissions cycle, which removed the reason many lower-scoring students had to test at all. The combined result was a sharp drop in volume and a thinning of the lower tail of the pool, which lifted the reported average through self-selection rather than through any improvement in learning. The College Board also retired the optional essay and the subject-area exams during this stretch. So the pandemic reshaped who tested and how the admissions system used results, and those changes, not a change in student ability, drive almost every visible movement in the period’s averages.
What is the test-optional self-selection effect on averages?
The self-selection effect is the mechanical rise in a reported average that occurs when participation is voluntary and the students most likely to opt out are the lower scorers. When a college drops the requirement, a student expecting a low result has little reason to spend a Saturday testing, so they stay home. Their absence removes low values from the calculation, raising the mean of those who remain even though no remaining student answered one more question correctly. The size of the lift depends on how many opt out and how far below the mean they sat. During the test-optional surge this effect was the dominant cause of the rising average, and reading that rise as evidence of improvement is the central error the period invites. The fix is to check what happened to participation before attributing any change in the average to performance.
Why might a 1300 mean a different percentile than it used to?
A percentile is a ranking against a reference group, not a fixed property of a result, so the same scaled total carries different percentiles when the reference group changes. A 1300 sitting against the thin, self-selected pool of the pandemic years ranks against a smaller, stronger field than the same 1300 sitting against the recovered, near-universal pool of later years. The equated proficiency is identical, but the share of the field at or below it differs because the shape of the surrounding distribution changed. This is why comparing a current result to an old percentile chart, or to a relative’s rank from a different year, misleads: the crowd moved. Always read the percentile printed on the current report against its stated reference year, and weight the scaled result, which is stable, more heavily than the percentile in any decision.
Why can a rising average not mean rising ability?
A rising average cannot be assumed to mean rising ability because the average is a quotient whose denominator and composition move independently of any individual’s performance. If the lowest scorers stop participating, the mean of everyone who remains climbs with nobody improving, which is exactly what the self-selection effect describes. Ability across a single graduating class does not swing fast enough to explain the movements seen in this period, and the average rose in the very year participation collapsed, the inverse of what a performance story predicts. So a rising average is consistent with rising ability, falling ability, or no change in ability at all, depending entirely on what happened to who tested. Until you account for the composition of the pool, the average tells you nothing reliable about whether students got better.
Did the digital format make scores go up or down?
The digital format most plausibly produced a small upward nudge at the individual level, driven by reduced fatigue from a shorter sitting, rather than any large movement in either direction. The equating process is specifically designed to hold the meaning of a scaled result constant across the format change, so the format itself should not inflate or deflate the equated average. Claims that the digital form handed out free points usually mistake a composition effect for a format effect, and claims that it made the exam harder ignore the equating that neutralizes difficulty differences. The responsible reading is that the format change matters a great deal for test-day strategy and experience but only modestly for the national average, with the bulk of the period’s movement coming from participation swings rather than from anything about paper versus screen.
How did test-taker numbers change over this period?
Volume fell sharply and then recovered. Participation sat near a historic peak of roughly 2.2 million before the pandemic, dropped steeply toward the mid 1.5 million range in the cycle most affected by closures and the test-optional surge, then climbed back toward and past 1.9 million as testing sites reopened and universal school-day administration resumed in several states. These figures should be confirmed against the College Board’s finalized reports, but the pattern of a steep drop followed by a strong recovery is the reliable shape. The volume swing is the engine behind most of the period’s score movement, because the students who left and returned were disproportionately from the lower part of the distribution, so their absence and return moved the average without moving any individual’s performance.
Why are year-to-year score comparisons sometimes misleading?
Year-to-year comparisons mislead whenever the composition of the test-taking pool changed between the two years, because the average reflects who tested as much as how well they did. Across this period the pool churned every single year through cancelled administrations, shifting college requirements, recovering participation, and changing state school-day contracts, so almost every adjacent pair of years involves comparing different populations. A comparison is only clean when the groups are comparable, and they rarely were here. The discipline is to put the participation rate and the composition of the pool beside every average before comparing two years, and to treat a difference as a population question first and a performance question only after composition has been accounted for. Skip that step and you will read population artifacts as performance trends.
How have colleges adapted their use of SAT scores?
Colleges responded to the period in divergent ways that left the landscape fractured rather than uniform. Most suspended their requirement during the pandemic out of necessity, then split: some committed to optional or test-blind policies long term, some quietly let suspensions lapse back into requirements, and a visible group of selective institutions reinstated the requirement after concluding a result added predictive signal, particularly for identifying talent at under-resourced schools. Superscoring also spread, with more colleges combining best section results across sittings. The practical consequence for an applicant is that there is no single national rule; each college’s current policy and its published scaled band must be checked individually, and the submit-or-withhold decision is made school by school against that college’s data rather than against any national figure.
Did score gaps widen or narrow over this period?
Gaps between groups defined by income, parental education, and background predate the pandemic by decades, and reading their movement across this period requires the same self-selection caution as the overall average. If opt-outs during the test-optional surge were unevenly distributed across groups, which is likely given uneven access to testing sites and information, then a measured narrowing or widening in a given year could reflect differential participation rather than differential learning. A group whose lower scorers opted out more would show an inflated average that year, distorting the visible gap. The neutral reading reports the direction of the published gap figures, attributes as much movement as the participation data can explain to composition, and reserves any claim about changes in opportunity or instruction for analysis that controls for who tested. Keep the discussion factual and confirm figures against official distribution data.
What does the recovery period after COVID show?
The recovery period, as participation climbed back toward universal levels, shows the self-selection effect running in reverse. As lower scorers returned to the pool, the lower tail of the distribution refilled and the reported average declined, not because instruction worsened but because the measured group broadened back toward the whole college-bound population and beyond. This is the clearest evidence that the pandemic-era rise was a composition artifact: the average moved up when the tail left and down when the tail returned, tracking participation rather than ability. The recovery years are therefore the control case that validates the self-selection reading of the earlier rise, and they are a useful reminder that a falling average during a recovery is a sign of a normalizing pool, not of declining students.
How does opting out change the reported average?
Opting out changes the reported average by removing specific values from the calculation, and because the students who opt out under voluntary policies are disproportionately lower scorers, their removal raises the mean of those who remain. The effect is purely arithmetic and requires no change in anyone’s performance. A simple illustration makes it concrete: a ten-student group averaging 1185 becomes a six-student group averaging 1300 if the four lowest scorers stay home, a jump of more than a hundred points with nobody answering an additional question correctly. The magnitude depends on how many opt out and how far below the mean they were. This is why a rising average in a year of falling participation should trigger the self-selection explanation before any story about improvement, and why participation must always be read beside the average.
Is the digital SAT scored the same as the paper SAT?
The digital and paper versions use the same scaled range, 400 to 1600 built from two section results of 200 to 800 each, and the equating and published concordance work are designed so that a given scaled total represents the same demonstrated proficiency on either form. In that sense the scoring is intended to be the same, and a college reading a digital result against a band built partly from paper-era students is comparing like with like by design. The caveat is that bridging two formats that differ in length and adaptivity carries more uncertainty than equating two paper forms, so the cross-format comparison is solid enough to interpret the trend while warranting extra caution right at the transition year. Confirm transition-year figures against finalized reports and avoid over-reading a single year’s movement at the format break.
What is the biggest misconception about SAT score trends?
The biggest misconception is that a movement in the national average reflects a movement in student ability. In reality the average is a quotient whose denominator and composition shift with participation, and across this period participation churned faster than any individual’s performance could. The average rose when lower scorers opted out and fell when they returned, the opposite of what an ability story would predict, which is the signature of a composition artifact. Reading the average as a verdict on learning produces the resilience myth, the format-inflation myth, and the habit of using the national figure as a personal target. The correction is the population-versus-performance distinction: account for who tested before attributing any change to learning. Master that one habit and nearly every confident error about the period stops fooling you.