Hi all,
I’m the developer of a package called ubelt, which is a curated utility library I’ve been writing / maintaining for over 6 years now, and I’ve been thinking about its future and its relationship to the Python ecosystem. At some point I had aspirations of merging pieces of it into the standard library (and in a few cases I still do), but I’ve come to the conclusion that most of it should live as a small curated package. I’ve also been thinking that I’d like to cut out some of the dead weight to focus its scope, but I have to be careful not to break my users, so I was thinking that it’s next iteration would live under a new name.
My thought was that ubelt might be a good candidate to transform into a scikit, specifically scikit-utils or skutil. Utility libraries are difficult to scope, and there are a lot of utility libraries out there, but mine was built with a bias towards scientific applications (as evidenced by: argmax, argunique, argsort, SetDict). Things like the download and hash_data functions are things other scientific users of Python might benefit from having faster 1-function interfaces to.
I’ve written things like Cacher, CacheStampto have the most concise usage patterns possible. A concurrent.futures-based JobPool lets you use 1 variable to toggle between thread, process, and no-parallelism. I’ve extended pathlib with a Path that has copy, move, delete, and ensuredir methods. The IndexableWalker (perhaps better called Indexable), provides efficient iteration through nested List/Dict. The cmd function expresses subprocess.run / subprocess.call invocations with a tee option that subprocess doesn’t support. There are on the order of 100 functions / classes in the module, and many are useful. (see the ubelt README for a quantitative measure of this). I think a lot of these would be useful to generic python users, but maybe especially so for other scientists who use Python.
What I’d like to know is the community’s thoughts on:
- creating a scikit-utilities: skutil module in general.
- initializing that module as some modified subset of ubelt
- tools from other libraries (e.g. boltons) that also might be included
- other ideas related to the topic
1 Like
Hi Jon,
Have you consider suggesting some additions to the python forum itself? https://discuss.python.org
Maybe they’d be interested in adding some of the ideas or give you feedback on why not.
1 Like
This is an interesting challenge: how to scope and package utilities. In JS land, you’d simply publish each function as a package, and actually that makes sense from a user perspective. But given that the whole library fits in a few K, I don’t think scope creep is too problematic.
I can already think of some places we use these, like hashing by chunk in skimage when checksumming images.
Thinking about the type of utility functions I keep around : a dictionary type that allowed chained access, eg d['x.y'], takeN from iterator, groupby… Numerous others too lying around, but I don’t want to make the scope creep problem worse 
In terms of what to include, a survey of libraries’ existing usage may be the place to start.
1 Like
I’ve considered releasing each tool as it’s own package. I’ve done that with some of them (progiter, timerit), but it’s a lot of work to maintain all of those packages. I’ve gotten good at maintaining dozens of packages, but hundreds is another order of magnitude. I’m still searching for a way to par it down (this is the original motivation for the post), and I think scoping is exactly the question that has to be answered for a “utils” module to not get out of hand.
Having a small KB library was a motivation for ubelt. My first attempt at an “extra-batteries” library - utool - got out of hand. Older versions did have dependencies, but the install time was noticable and I wanted something that could be quickly installed in places with unreliable internet. Even though it is dependency free, it does have lazy and optional integration with some numpy / pandas data structures - things like ub.urepr and ub.hash_data can handle those cases via a plugin / singledispatch system. It also supports things like xxhash / blake3 for hashes if you have those libraries at runtime.
I think these traits are desirable:
- Small size
- All dependencies are optional (e.g. you have to install numpy for it to acknowledge it)
- Decoupled (modules are largely independent; only simple dependencies, e.g. util_cache depends on util_hash)
- Concise API
- Shallow limited width tree API
- Useful / intuitive / simple
- Fast
- Pure Python - (maybe optional binary extensions )
- 100% coverage - (or at least close to it)
- good docstrings / doctests
I also agree that a survey would be useful. More can be done on that front. I have a blurb about related work in the README. Currently, ubelt has vendored a few libraries that did what I wanted better than I could: (e.g. orderedset, the download function from pytorch).
The dictionary with chained access tricky to get working efficiently. The closet thing I have is ubelt.IndexableWalker, which lets you past lists of nested indexes, so you could do something like ub.IndexableWalker(mydict)[".".split("my.nested.key")]. I think we could implement a wrapper for that. This is one of the cases I would like to improve in a hypothetical skutil module: rename IndexableWalker to Indexable, or integrate the nested lookup into something like ubelt.UDict - which is another discussion.
I like UDict a lot. It’s a simple subclass of dict, that adds set methods + convenience methods. (The set methods are actually in their own SetDict class, which UDIct inherits from, but I digress). Being able to select subsets of dictionaries with key-based set intersections is so nice and concise. Adding a nested lookup method for it seems reasonable - maybe a special case in getitem? Even though UDict is new, the string “ub.udict” has one of the highest frequencies in my other projects. IMO, key-based set methods should be added to the stdlib.
1 Like