Optimizing SQL query to calculate averages from large data sets

6 min readDec 21, 2022

We have a page to display the score report for students after they have taken an online exam. In the score report, we need to show not only the student's scores but also the average scores of other students from across classes. The average score is calculated from the existing data in the database using all previous attempts by other students. To calculate the average, there are a couple of conditions that we need to consider.

Here are the conditions:

A minimum of 100 attempts must already be completed for this exam.
At least 5 class has taken this exam with a minimum of 20 attempts.
Do not display the average if the above conditions are not met.

The average value displayed is going to be used only for display to guide students on what is the average score other students are getting. So it does not have to be accurate in real time.

Here is what the existing database model looks like.

Based on the model above, to calculate the average score for a particular exam, we could run a SQL query to get each attempt for a given exam_id.

SELECT
  students.class_room_id AS class_room_id,
  exam_attempts.id AS exam_attempt_id,
  COUNT(DISTINCT questions.id) AS total_questions,
  SUM(CASE WHEN answers.is_correct THEN 1 ELSE 0 END) AS correct_answers
FROM exam_attempts
INNER JOIN students ON exam_attempts.student_id = students.id
INNER JOIN questions ON exam_attempts.exam_id = questions.exam_id
INNER JOIN answers
  ON exam_attempts.id = answers.exam_attempt_id
  AND questions.id = answers.question_id
WHERE exam_attempts.exam_id = 301
GROUP BY students.class_room_id, exam_attempts.id

This query would produce the result of all the attempts on that particular exam as shown in the table below.

Now we can use the above query result as a sub-query to calculate the average in another query that aggregates the average score and also make sure it satisfies the 3 conditions mentioned above.

There is one big problem with this approach. The query result would scan the whole exam_attempts table, and return every attempt ever taken for a particular exam. Loading this result into a sub-query would need a huge amount of memory proportionate to the data size in this table.

We need to limit the number of results from the query. So instead of querying every attempt, we can group all attempts by class_rooms and exams. This way we can reduce the number of results from the query by aggregating all attempts for a particular exam and classroom.

Let's modify the previous query to achieve this optimization as shown in the new modified query below. The major difference is that now, we are grouping by class_rooms and exams instead of class_rooms and exam_attempts.

SELECT
  students.class_room_id AS class_room_id,
  COUNT(DISTINCT exam_attempts.id) AS exam_attempts_count,
  COUNT(DISTINCT questions.id) AS total_questions,
  SUM(CASE WHEN answers.is_correct THEN 1 ELSE 0 END) AS total_correct
FROM exams
INNER JOIN exam_attempts ON exam_attempts.exam_id = exams.id
INNER JOIN students ON exam_attempts.student_id = students.id
INNER JOIN questions ON exam_attempts.exam_id = questions.exam_id
INNER JOIN answers
  ON exam_attempts.id = answers.exam_attempt_id
  AND questions.id = answers.question_id
WHERE exam_attempts.exam_id = 301
GROUP BY students.class_room_id, exams.id

This new query would produce a result of all the classroom's attempts on that particular exam, aggregated as shown in the table below.

We can be sure that this new query would be more efficient. Let's say there is C number of classrooms that can take up to E number of exams, then the maximum number of rows this query would return is C * E.

Let’s say there is N number of exam attempts, then we can be sure that C * E would always be less than N. Unless in the worst case scenario, that is when there is N number of classrooms with 1 student for each classroom and 1 attempt per student.

In this worst case, C * E = N.

This would only happen in the worst-case condition. This would hardly be the case since a classroom must have many students, thus the number of row results would be proportionate to the number of classrooms.

Now that this has improved the efficiency of the solution, one other issue is now raised.

Would the average value calculated on the result of the aggregated rows would be the same as the averages calculated on each attempt?

On the first query, the average is calculated by summing the score for each attempt row and dividing it by the total number of attempts or rows.

To calculate the average, we need to sum the score on each attempt and divide it by the number of attempts.

SUM(correct_answers/total_questions) / total_attempts

That is:

(
 (3/5) + (3/5) + (0/5) + (2/5) + (3/5) + (1/5) + (3/5) + 
 (3/5) + (2/5) + (3/5) + (1/5) + (3/5) + (2/5) + (3/5)
) / 14 = 0.4571

On the second query, the average score is calculated by:

SUM(total_correct) / ( total_attempts * total_questions )

That is:

(9 + 8 + 6 + 5 + 4) / (14 * 5) = 0.4571

Based on the small data set we have above, it shows that both query results produce the same average score value. However, we need to be sure this would produce the same result for any other data sets.

We can prove that with simple mathematical deductions. We need to prove that the first query’s average calculations are the same as the second query’s average calculation.

Let's represent the first average calculations as below formula:

and the optimized query’s average calculations as

We can now expand the first query as shown below

and expand the second query as shown below

We can see that the end result of the formula for both query matches, given the values of the total questions for each exam is the same. This means that q1 = q2 = q3 = … = qn in the first must also be the same in value for q in the second query.

Since the query filters by a particular exam type, the total number of questions (q) would be the same for every attempt on this particular exam.

Now that we are clear about the SQL query, let's look at other considerations for implementing this feature.

Since we have to display this average value on the score report page, we could attempt to calculate this value each time the page is loaded. But this is highly inefficient for the following reasons:

Running this query every time the score report page is loaded would put too much strain on the database and server.
For each page load, it is going to be running the same query at that point in time that is going to produce the same result. As the data set grows, the average value is not going to be having too many variations.

We don’t need the value to be accurate in real-time, this means we can pre-calculate the value once every t hour and store it in the database table. Then when the page is loaded, we can just pick up this pre-calculated value and display it.

Optimizing SQL query to calculate averages from large data sets

Written by Shah Nawas Khan

No responses yet