Feature Addition: Updating `scipy.io.loadmat` to extract data from MATLAB objects/classes

Hi All!
Recently I was working with some .mat files on Python, using scipy.io.loadmat to load the variables from the files. However, scipy lacks enough functionality to extract contents from classes/objects defined in MATLAB. These may be both user-defined or datatypes like string and datetime which are implemented as class objects.

Background

Currently, the loadmat method is able to extract some headers from objects and wrap them in a MatlabOpaque object. This data is returned into a dictionary with the key 'None'. This is because loadmat does not extract headers for this particular datatype. This means that you will only be able to obtain the MatlabOpaque data for only the last object in a MAT file. Nevertheless, the MatlabOpaque data only contains headers, and does not contain any properties of the class itself.

All object data in MAT files are contained in a separate portion called the subsystem data. This part of the data is structured like a mini-MAT file, but without a variable name. Currently, loadmat parses this binary data as uint8 integers, and saves it to the key __function_workspace__.

Proposed Feature Addition to Loadmat

I believe loadmat's functionality can be easily extended to parse and extract object data from MAT files as well. I have spent some time reverse engineering the MAT file structure for contents, taking some help from mahalex’s guide on github which gives a detailed breakdown on how MAT files are structured for objects.

I am quite familiar with the structure, and I’m also still working on further decoding certain elements of this file structure, as it is not publicly documented.

Implementation

Going through the existing code in the scipy.io.matlab, this functionality can be added on top without changing any of the existing codebase. As loadmat already returns the subsystem data to __function_workspace__, only a couple of changes are required:

  1. Decode the MatlabOpaque object and extract object metadata from it. This contains information like object_id, class_id, array_name, n_dims and dims. This information can then be saved to the header in VarHeader5
  2. Parse the subsystem data stored in __function_workspace__ as another MAT-file. The subsystem data contains information about every object in the MAT-file and their relevant properties. Each unique object and class in a MAT-file is assigned a unique ID, and the subsystem data essentially contains instructions on how to link the object properties to its ID.

Once this is done, every object extracted from the subsystem data can be assigned to the variable extracted in Step 1. A detailed algorithmic plan can be drawn up if required.

Conclusion

Adding this feature will improve scipy.io.loadmat by allowing it to extract class objects from MAT files, making it more versatile for users working with MATLAB data on Python.

I would be happy to contribute to this feature addition myself. I am confident in my understanding of the MAT file structure and I’m also sufficiently familiar with the scipy.matlab.io codebase. However, I’m not too familiar with code optimization techniques, particularly in Cython, which most of the codebase is written in.

Any guidance or suggestions from the community would be really appreciated! Looking forward to hearing your thoughts!