Recently, I came across a problem to operate on a large matrix stored in the form of csv. I needed to split this large csv file into smaller files so that each of those smaller files can fit into main memory for any kind of processing. In this article, I am going to walk through the process that I followed to achieve the same.
Let’ s start with the considerations:
- The matrix must be split in such a way that the recreation of the original matrix is possible i.e. it must be a non destructive split.
- In the process of splitting/merging, at no point can we bring the entire original matrix into the main memory.
With the above 2 points in mind, there are two possible ways that we can work this out.
- Splitting by Row/Column: Here, we extract a certain number of rows/columns into a separate file and store the individual files in the correct order of the row/column number that they represent. Realistically, reading by rows from a file is more convenient as it is a O(n) operation w.r.t time complexity, whereas reading by column would be a O(n²) operation, since we can only read files by lines.
- Splitting into smaller matrices: Here, we split a N x N matrix into k² smaller M x M matrices where N = kM. These smaller M x M matrices are then stores as individual files numbered by the position of the smaller matrix in the original larger matrix (Example to follow).
The problem with the first approach is that a single row/ column of the original matrix might itself be too large to fit into a computer’s main memory. Hence, I chose the second approach.
Now, let’s get into the details of the second approach. First of all we need to decide reasonable values for M and k such that M² elements can fit into main memory at a time.
Let’s assume the number of bytes in the main memory available for us to use is x and the original matrix contains only integers of size 4 bytes. Hence, to fit M² elements in x bytes of memory with each element being of size 4 bytes, we land up with the following calculated value of M.
M² = (x/4)
Correspondingly, k can be calculated as:
k = N/M
Once, we have the values for M and k, we are good to start with the actual implementation of the algorithm.
Let’s take the following matrix as example:
Assuming I can have only 9 elements in my main memory at a time, we break it into smaller 3 x 3 matrices as follows:
Here, each colored section will be stored in a separate file. Now, the question is that how do we number these sections or blocks.
Let’s overlay the above matrix with a 3 x 3 matrix like below:
Now, we just follow the above numbering i.e. the first 3 x 3 matrix would be numbered (0,0), the second would be numbered (0,1) and so on.
That’s about it for the storing the split sections of the original matrix. The following snippet can be used to implement it:
Now to recreate the original matrix, all that we really need to do is arrange the smaller 3 x 3 matrices according to the order decided by their numbering and print those into a new file, which would be identical to the original matrix. The following snippet may be used to achieve this:
The thing to note in the snippet above is that a row of the original matrix can be created by just merging rows across horizontal blocks, which is where the inner loop must start from
For example, row 5 of the matrix given above would be present in blocks (1,0), (1,1) and (1,2), and this row number can be determined by the above function i.e.
floor(5/3) = 1
As a part of the same problem, I needed to handle efficient storage of sparse matrices, which has to be a separate topic altogether.
Thanks for the read and please feel free to drop in any feedback that you might have.