(Database Query Processing)
A Worked Example with SQL and algorithmic processing
using primitives (with analysis of performance)
Consider a database with 3 tables,
STUDENT-INFO,
COURSE-INFO, and
ENROLMENT.
Assume
•
the STUDENT-INFO table has 30,000 (3x104) rows,
•
the COURSE-INFO table has 1,000 (103) rows,
[BiYing checked CORS & said 1365 for Spr 2009. Thx]
•
the ENROLMENT table has 100,000 (105) rows.
Student-ID | Name | NRIC-ID | Address | Tel-No | Faculty | Major |
... |
... |
... |
... |
... |
... |
... |
|
|
Question Q4 from Spring 2012 Quiz -- SOLUTION SKETCH
(a) List the Course-ID, Day, Hour of all courses taught in the venue "USP-SR1".
We first give the declarative SQL-query: (that declares what, but not how.)
SELECT Course-ID, Day, Hour FROM CI WHERE (Venue = 'USP-SR1'); |
And here is the algorithmic version (using e-project, e-select, e-join): (the how)
S1 <-- e-select from CI WHERE (Venue = 'USP-SR1'); Ans <-- e-project Course-ID, Day, Hour from S1; |
Analysis: (Assume m = number of courses taught in USP-SR1 = 20)
- First step (e-select) takes 1,000 row ops;
table S1 has m = 20 rows.
- 2nd step (e-project) takes O(m) = 20 row ops;
Total: About 1,020 row operations!
(b) List the Student-ID, Major of all FASS students who have classes
in "USP-SR1".
We first give the declarative SQL-query: (that declares what, but not how.)
SELECT Student-ID, Major FROM CI, SI, EN WHERE (Venue = 'USP-SR1') AND (Faculty = "FASS' ) AND (SI.SID = EN.SID ) AND (CI.CID - EN.CID ); |
There are many different algorithmic solutions to
process this query with the primitives: (using e-project, e-select, e-join)
(the how). They have different computational efficiencies.
Call this Algorithm Sp-2010-Q4-X (slow algorithm).
X1 <-- e-join SI and EN WHERE (SI.Student-ID = EN.Student-ID); X2 <-- e-join X1 and CI WHERE (CI.Course-ID = X1.Course-ID); X3 <-- e-select from X2 WHERE (Faculty='FASS') and (Venue='USP-SR1'); Ans <-- e-project Student-ID, Major from X3; |
Analysis: (Assume q = # of FASS students taking classes in USP-SR1 = 50)
- First step (e-join) takes (30,000*100,000) row ops;
table X1 has at least 100,000 rows.
- 2nd step (e-join) takes (1,000*100,000) row ops;
table X2 has at least 100,000 rows.
- 3rd step (e-select) takes 100,000 row ops;
table X3 has q = 50 rows.
- 4th step (e-project) takes 50 row ops;
Total: About 3,100,100,050 (about 3.1 x 109) row operations!
(dominated by the first join operations!)
A more efficient algorithm is Algorithm Sp-2010-Q4-G (faster algorithm).
G1 <-- e-select from SI WHERE (Faculty='FASS'); G2 <-- e-select from CI WHERE (Venue='USP-SR1'); G3 <-- e-join G2 and EN WHERE (G2.Course-ID = EN.Course-ID); G4 <-- e-join G3 and G1 WHERE (G3.Student-ID = G1.Student-ID); Ans <-- e-project Student-ID, Major from G4; |
Analysis: (Assume # of FASS students = 3000, # of classes in USP-SR1 = 20)
- 1st step (e-select) takes 30,000 row ops;
table G1 has 3,000 rows.
- 2nd step (e-select) takes 1,000 row ops;
table G2 has 20 rows.
- 3rd step (e-join) takes (20*100,000) row ops;
table G3 has about 200 rows. [updated number]
- 4th step (e-join) takes (200*3,000) row ops;
table G4 has about 50 rows.
- 5th step (e-project) takes 50 row ops;
Observation:
There is about 1,200 times difference between algorithm Sp-2010-Q4-X and Sp-2010-Q4-G. (3.1 x 109 / 2.63 x 106 = 1,187). That's a difference of (1sec vs 20min) or (1 day vs 3.25 years)!! |
[Note added after tutorials on Wed, 06-Mar:
I thank everyone for the corrections, questions, discussions on this worked example -- especially on the estimation of the sizes of the intermediate tables. It is very good and I thank all of you for it. I have made changes to some of the estimated sizes as a result. So, while the final conclusion does not change (a further testament to Enrico Fermi's claim on Fermi-type problems), it is still better to have more reasonable estimates. [End of note] |
Still another possible algorithm is Algorithm Sp-2010-Q4-H.
H1 <-- e-select from SI WHERE (Faculty='FASS'); H2 <-- e-select from CI WHERE (Venue='USP-SR1'); H3 <-- e-join H1 and EN WHERE (H1.Student-ID = EN.Student-ID); H4 <-- e-join H3 and H2 WHERE (H3.Course-ID = H2.Course-ID); Ans <-- e-project Student-ID, Major from H4; |
Analysis: (Assume # of FASS students = 3000, # of classes in USP-SR1 = 20)
- 1st step (e-select) takes 30,000 row ops;
table H1 has 3,000 rows.
- 2nd step (e-select) takes 1,000 row ops;
table H2 has 20 rows.
- 3rd step (e-join) takes (3,000*100,000) row ops;
table H3 has about 15,000 rows.
- 4th step (e-join) takes (20*15,000) row ops;
table H4 has about 50 rows.
- 5th step (e-project) takes 50 row ops;
Note:
There are other algorithms that you can also invent for processing
this query with performance somewhere between Sp-2010-Q4-G and Sp-2010-Q4-X.
Q: Can you find two of them?
Final Lesson:
Always do e-join last (wherever possible). And e-join with smallest tables first, whenever possible. |