Unlike me, my wife uses computers for serious calculations to answer important questions. One of these problems involves taking the cross-correlations of the continuous seismic data of hundreds of measuring stations. A computation that can take days or even weeks. Interestingly, it turns out that a large part of this time is spent loading data from disks into memory. To better understand the problem, let us take a high-level look at the calculation of cross-correlations in Pythonesque pseudo-code:
# given is some list of stations, say a list of names of the data files stations = [..] # correlate, one by one, each unique pair of stations for i in range(len(stations)) : for j in range(i+1, range(len(stations))) : # the following correlates both the pair (i,j) and (j,i) computation(stations[i], stations[j])
The pseudo-code traverses all possible combinations of two stations and computes the so called cross-correlation. Each station’s signal consists of a large data-file that is stored on disk and must be loaded into memory to compute the correlation. Loading memory from disk is an expensive operation, because even modern disks are slow compared to the processor and memory. Luckily the system caches data, i.e., it reuses data that is still stored in memory in stead of reloading it. But the system has limited and varying (on a shared system) memory capacity. When the memory is full, data that is not yet in the memory will replace the least recently used data.
Thus the order in which we access the station’s data – the so called schedule – determines the number of disk accesses. The question is, what schedule minimizes the number of disk reads and maximize the use of the cache. Moreover, what can we do without knowing the size of the memory?
As I found out two years ago the Sierpinski fractal curve provides a good, and probably asymptotically optimal, schedule. For my and (hopefully) your amusement, I’ve now rewritten some of the experiments in the programming language Elm to present the elegant Sierpinsky curve and its discretized brother the H-index.
Computing the Cost of a Schedule
A schedule can be visualized as a path through a square matrix where each cell consists of a combination of two stations. Such a combination is called a job. Only the lower (or upper) triangle of the matrix need to be visited as we consider (3,4) a duplicate of (4,3). The pseudo-code above defines the following schedule that we call the scanline schedule.
The displayed schedule thus starts correlating station 1 with itself, then correlates 1 with 2, 2 with 2, etc. The cost – in terms of the number of disk accesses or cache misses – of this schedule can be computed by simulating the execution while keeping track of the cache’s contents and counting the number of times we load a station’s data that is not yet in memory. Running such a simulation with memory capacity for five elements gives the following cumulative cost for the successive executions.
The cumulative cost shows that the scanline schedule initially (up to 15) performs quite well, but once the horizontal stretches of the schedule are longer than 5 each new job results in a cache miss. In fact, one can do quite a bit better by making shorter stretches.
Give it a try in the scheduler below! Click on the circles in the matrix to schedule a job and on + and – in the upper right corner to increase or decrease memory capacity. The goal is to minimize total cost of the schedule.
To help you make a path, unscheduled elements in the path have been marked and colored with their cost.
The Sierpinsky Curve or H-Index
It turns out to be quite challenging to find the best schedule with the least disk accesses. Some experimenting also shows that the performance of a schedule changes quite a bit for different memory sizes. That complicates our task, especially considering that memory capacity may change during execution. For example, when the correlations are done on a supercomputer that is shared by other scientists running similar experiments.
However, one quickly notices that it is generally better not to make large jumps. That is, to make a schedule where the most recent jobs are close together in the matrix. This observation leads to a divide and conquer strategy, where we subdivide the triangle of jobs into smaller triangles and schedule the jobs in the first triangle before the second. Applying this pattern recursively, we get the following sequence.
In addition to the subdivision of triangles, a line is drawn that visits each of the triangles exactly once. The line is the (lower triangle of the) Sierpinsky curve a fractal curve that, when the subdivision is repeated infinitely, will visit each point in the triangle without every crossing itself.
Now, consider the triangle of jobs to be scheduled and subdivide it recursively in smaller triangles until all triangles consist of exactly one job. The Sierpinsky curve through the triangles now describes a schedule!
Well almost: the jobs cannot properly be subdivided in triangles because jobs cannot be split. Therefore, we use the discrete incarnation of the Sierpinski curve that is called the H-index. In this context “index” is another word for schedule and the H stands for the shape the pattern takes: there’s an H on its side in the lower right corner. The H-index defines the following schedule:
The strength of the H-index is that it has very good locality: any two jobs that are scheduled closely together (in time) will also be close together (in distance) in the matrix of jobs. Locality is important because jobs that are close in the matrix are more likely to be in the cache.
How much better does the H-index do? The following graph plots the cost in termes of cache misses of each schedule for a case with 64 stations and therefore 576 jobs for increasing memory capacity.
The range is limited from 2 to 64: one needs to be able to store at least two and if memory has capacity for 64 stations all schedules will have cost 64 because every job has to be loaded exactly once. The cost of the scanline algorithm is dominated by a linear relation also seen in the last part (after job 16) in the previous plot. One may also note that the H-index performs best when the memory capacity is a power of two, note in particular the low points at 4, 8 and 16. In between those spots the H-index becomes slightly irregular but generally the plotted curve quicly approaches the minimum number of loads.
Conclusion
Hopefully the final plot convinces you that the Sierpinski curve, or more accurately, H-index performs much better than the scanline schedule. However, I have not proven that the H-index is the optimal schedule. In fact, one can easily show that for certain memory sizes, such as in the manual scheduling exercise, it is possible to do (slightly) better than the H-index. Rolf Niedermeiera, Klaus Reinhardt and Peter Sanders, who invented the H-index, have given strong evidence that it is asymptotically optimal.
Does this mean we should remove all matrix traversals based on nested for loops – the scanline schedule – and replace them with with a H-index based traversel? Probably not. Much has been done to optimize such loops in compilers to fine-tune their caching behavior. The experiment presented here only took into account a very special case and has many assumptions on the behavior of the cache that will not hold for many systems.
For example, cpu caches typically use different replacement strategies; they do not simply replace the least recently used element if it is full. Yet, the Sierpinsky curve and H-index have found their way into many computing-intensive applications, from image processing to databases, and may also benefit you.
This post is too small to introduce a practical incarnation of the curve. An attentive reader may have noticed that I’ve only used examples where the number of stations is a power of two. This is no coincidence: the H-index is really only defined for those sizes. One can, however, adapt the index for larger sizes in ways that most of the schedule will be according to the h-index. Practical applications and their empirical evaluation is left to future work.
Post Scriptum
As for my previous project in Elm, I had fun writing this. Working with such a new language is an interesting experience: one one hand you are exploring new ways to do things, but on the other hand you have to reinvent the wheel all the time. It certainly makes you appreciate the time and effort spent by others on libraries in more mature languages. For example, I spent an inordinate amount of time on the plots, at times resulting in atrocious code.
You are free to look at/steal the code but you’re going against doctor’s orders.