Scientific Utilities

Erotemic · June 17, 2023, 2:50am

Hi all,

I’m the developer of a package called ubelt, which is a curated utility library I’ve been writing / maintaining for over 6 years now, and I’ve been thinking about its future and its relationship to the Python ecosystem. At some point I had aspirations of merging pieces of it into the standard library (and in a few cases I still do), but I’ve come to the conclusion that most of it should live as a small curated package. I’ve also been thinking that I’d like to cut out some of the dead weight to focus its scope, but I have to be careful not to break my users, so I was thinking that it’s next iteration would live under a new name.

My thought was that ubelt might be a good candidate to transform into a scikit, specifically scikit-utils or skutil. Utility libraries are difficult to scope, and there are a lot of utility libraries out there, but mine was built with a bias towards scientific applications (as evidenced by: argmax, argunique, argsort, SetDict). Things like the download and hash_data functions are things other scientific users of Python might benefit from having faster 1-function interfaces to.

I’ve written things like Cacher, CacheStampto have the most concise usage patterns possible. A concurrent.futures-based JobPool lets you use 1 variable to toggle between thread, process, and no-parallelism. I’ve extended pathlib with a Path that has copy, move, delete, and ensuredir methods. The IndexableWalker (perhaps better called Indexable), provides efficient iteration through nested List/Dict. The cmd function expresses subprocess.run / subprocess.call invocations with a tee option that subprocess doesn’t support. There are on the order of 100 functions / classes in the module, and many are useful. (see the ubelt README for a quantitative measure of this). I think a lot of these would be useful to generic python users, but maybe especially so for other scientists who use Python.

What I’d like to know is the community’s thoughts on:

creating a scikit-utilities: skutil module in general.
initializing that module as some modified subset of ubelt
tools from other libraries (e.g. boltons) that also might be included
other ideas related to the topic

martinberoiz · June 19, 2023, 4:54pm

Hi Jon,

Have you consider suggesting some additions to the python forum itself? https://discuss.python.org

Maybe they’d be interested in adding some of the ideas or give you feedback on why not.

stefanv · June 19, 2023, 5:52pm

This is an interesting challenge: how to scope and package utilities. In JS land, you’d simply publish each function as a package, and actually that makes sense from a user perspective. But given that the whole library fits in a few K, I don’t think scope creep is too problematic.

I can already think of some places we use these, like hashing by chunk in skimage when checksumming images.

Thinking about the type of utility functions I keep around : a dictionary type that allowed chained access, eg d['x.y'], takeN from iterator, groupby… Numerous others too lying around, but I don’t want to make the scope creep problem worse

In terms of what to include, a survey of libraries’ existing usage may be the place to start.

Erotemic · June 24, 2023, 3:30am

I’ve considered releasing each tool as it’s own package. I’ve done that with some of them (progiter, timerit), but it’s a lot of work to maintain all of those packages. I’ve gotten good at maintaining dozens of packages, but hundreds is another order of magnitude. I’m still searching for a way to par it down (this is the original motivation for the post), and I think scoping is exactly the question that has to be answered for a “utils” module to not get out of hand.

Having a small KB library was a motivation for ubelt. My first attempt at an “extra-batteries” library - utool - got out of hand. Older versions did have dependencies, but the install time was noticable and I wanted something that could be quickly installed in places with unreliable internet. Even though it is dependency free, it does have lazy and optional integration with some numpy / pandas data structures - things like ub.urepr and ub.hash_data can handle those cases via a plugin / singledispatch system. It also supports things like xxhash / blake3 for hashes if you have those libraries at runtime.

I think these traits are desirable:

Small size
All dependencies are optional (e.g. you have to install numpy for it to acknowledge it)
Decoupled (modules are largely independent; only simple dependencies, e.g. util_cache depends on util_hash)
Concise API
Shallow limited width tree API
Useful / intuitive / simple
Fast
Pure Python - (maybe optional binary extensions )
100% coverage - (or at least close to it)
good docstrings / doctests

I also agree that a survey would be useful. More can be done on that front. I have a blurb about related work in the README. Currently, ubelt has vendored a few libraries that did what I wanted better than I could: (e.g. orderedset, the download function from pytorch).

The dictionary with chained access tricky to get working efficiently. The closet thing I have is ubelt.IndexableWalker, which lets you past lists of nested indexes, so you could do something like ub.IndexableWalker(mydict)[".".split("my.nested.key")]. I think we could implement a wrapper for that. This is one of the cases I would like to improve in a hypothetical skutil module: rename IndexableWalker to Indexable, or integrate the nested lookup into something like ubelt.UDict - which is another discussion.

I like UDict a lot. It’s a simple subclass of dict, that adds set methods + convenience methods. (The set methods are actually in their own SetDict class, which UDIct inherits from, but I digress). Being able to select subsets of dictionaries with key-based set intersections is so nice and concise. Adding a nested lookup method for it seems reasonable - maybe a special case in getitem? Even though UDict is new, the string “ub.udict” has one of the highest frequencies in my other projects. IMO, key-based set methods should be added to the stdlib.

Erotemic · June 24, 2023, 3:45am

I think that’s a good idea. I made a post: Dictionaries should have key-based set operations - Ideas - Discussions on Python.org