Welcome to TDBSoverflow, Our class's own StackOverflow. Our rules:
  1. Use only meaningful and self-explanatory titles
  2. Tag your questions with meaningful keywords
  3. Use upvotes and downvotes to rate the answers
  4. When you receive a satisfying answer - Click the "V" button
Remember: you may get up to 5 bonus points to your final grade!

Sort Merge Join I/O Complexity

+2 votes
38 views

In Presentation 13, slide 24, about the complexity of sort merge join, it's mentioned that in the worst case (where all the values in both columns are equal), the join phase can take up to B(S)*B(R) IOs.

Then, it says in brackets: (But we can avoid it! How?)

So, how can we avoid it?

Thanks

asked Feb 4, 2018 by Assaf (31,090 points)

1 Answer

+1 vote

The base algorithm shown in the presentation iterates over the tuples of S and R. That means, if it encounters equality of S.a=R.b, let the equal tuples be numbered as S[i], R[j], then it continues checking equality for:

S[i].a = R[j+1].b

S[i].a = R[j+1].b

and so on, breaking only when reaching the end of R relation \ inequality.

Thus for the next tuple of S, we may need to re-read the already read blocks of R.

A simple improvement of the algorithm may use Page Oriented equality check:

Let M denote the number of blocks our memory can contain. We need 1 buffer block for R, 1 buffer block for the output and we can use the remaining M-2 blocks for loading S blocks.

Thus the number of I/O operations is: B(R) * (B(S) / (M - 2)) + B(S)

Hope it helps.

answered Feb 4, 2018 by oz (12,430 points)
edited Feb 4, 2018 by oz
Sorry oz, that doesn't answer my question -
Still, when using page oriented equality check, Assuming values are the same in this column for the entire relations (same value for both relations), you'll end up doing B(R) * B(S) IOs.
And in class we said it can take only B(R) + B(S) IOs.
So this suggestion (if I got it correctly) is not enough
Let M denote the number of blocks our memory can contain. In the method I've mentioned we need 1 buffer block for R, 1 buffer block for the output and we can use the remaining M-2 blocks for loading S blocks.

Thus the number of I/O operations is: B(R) * (B(S) / (M - 2)) + B(S)

It is an improvement, yet not the optimal solution as you commented here. I didn't know there is a way to completely overcome this edge case; let's wait for another answer :)
Good question! I think in that case (when the number of tuples having the same value in the join column exceeds the memory size)  you should use a different join algorithm which is less sensitive to such skewness, e.g. nested loop.
but It won't be B(R) + B(S) even with nested loop, won't it?
Sadly, no.............
...