# Splitting a large matrix into smaller files

Recently, I came across a problem to operate on a large matrix stored in the form of csv. I needed to split this large csv file into smaller files so that each of those smaller files can fit into main memory for any kind of processing. In this article, I am going to walk through the process that I followed to achieve the same.

Let’ s start with the considerations:

- The matrix must be split in such a way that the recreation of the original matrix is possible i.e. it must be a non destructive split.
- In the process of splitting/merging, at no point can we bring the entire original matrix into the main memory.

With the above 2 points in mind, there are two possible ways that we can work this out.

- Splitting by Row/Column: Here, we extract a certain number of rows/columns into a separate file and store the individual files in the correct order of the row/column number that they represent. Realistically, reading by rows from a file is more convenient as it is a O(n) operation w.r.t time complexity, whereas reading by column would be a O(n²) operation, since we can only read files by lines.
- Splitting into smaller matrices: Here, we split a
*N x N*smaller*k²**M x M*. These smaller*N = kM**M x M*

The problem with the first approach is that a single row/ column of the original matrix might itself be too large to fit into a computer’s main memory. Hence, I chose the second approach.

Now, let’s get into the details of the second approach. First of all we need to decide reasonable values for *M* and *k* such that *M²* elements can fit into main memory at a time.

Let’s assume the number of bytes in the main memory available for us to use is *x *and the original matrix contains only integers of size *4 bytes. *Hence, to fit *M²* elements in *x* bytes of memory with each element being of size *4* bytes, we land up with the following calculated value of *M*.

M² = (x/4)

Correspondingly, k can be calculated as:

k = N/M

Once, we have the values for *M* and *k*, we are good to start with the actual implementation of the algorithm.

Let’s take the following matrix as example:

Assuming I can have only *9* elements in my main memory at a time, we break it into smaller *3 x 3* matrices as follows:

Here, each colored section will be stored in a separate file. Now, the question is that how do we number these sections or blocks.

Let’s overlay the above matrix with a *3 x 3* matrix like below:

Now, we just follow the above numbering i.e. the first *3 x 3* matrix would be numbered* (0,0)*, the second would be numbered *(0,1)* and so on.

That’s about it for the storing the split sections of the original matrix. The following snippet can be used to implement it:

Now to recreate the original matrix, all that we really need to do is arrange the smaller 3 x 3 matrices according to the order decided by their numbering and print those into a new file, which would be identical to the original matrix. The following snippet may be used to achieve this:

The thing to note in the snippet above is that a row of the original matrix can be created by just merging rows across horizontal blocks, which is where the inner loop must start from

floor(row/BLOCK_ROW_COUNT)

For example, row 5 of the matrix given above would be present in blocks (1,0), (1,1) and (1,2), and this row number can be determined by the above function i.e.

floor(5/3) = 1

As a part of the same problem, I needed to handle efficient storage of sparse matrices, which has to be a separate topic altogether.

Thanks for the read and please feel free to drop in any feedback that you might have.