Standard way of sharing architectures, e.g., to PyPi.org

SamuelMarks · December 8, 2021, 3:17am

There are a huge number of new statistical, machine-learning and artificial intelligence solutions being released every month.

Most are open-source and written in a popular Python framework like TensorFlow, JAX, or PyTorch.

In order to ‘guarantee’ you are using the best [for given metric(s)] solution for your dataset, some way of automatically adding these new statistical, machine-learning and artificial intelligence solutions to your automated pipeline needs to be created.

(additionally: useful for testing your new optimiser, loss function, &etc. across a zoo of datasets)

Ditto for transfer learning models. A related problem is automatically putting ensemble networks together. Something like:

import some_broke_arch  # pip install some_broke_arch
import other_neat_arch  # pip install other_neat_arch
import horrible_v_arch  # builtin to keras

model   = some_broke_arch.get_arch(   **standard_arch_params  )
metrics = other_neat_arch.get_metrics(**standard_metric_params)
loss    = horrible_v_arch.get_loss(   **standard_loss_params  )

model.compile(loss=loss, optimizer=keras.optimizers.RMSprop, metrics=metrics)
print(model.summary())
# &etc.

In summary, I am petitioning for standard ways of:

exposing algorithms for consumption;
combining algorithms;
comparing algorithms.

To that end, I would recommend encouraging the PyPi folk to add a few new classifiers, and a bunch of us trawl through GitHub every month sending PRs to random repositories—associated with academic papers—linking up with CI/CD so that they are now installable with pip install and searchable by classifier on PyPi.

my open-source multi-ML meta-framework;
- uses builtin ast and inspect modules to traverse the module, class, and function hierarchy for 10 popular open-source ML/AI frameworks;
- will enable experimentation with entire ‘search-space’ of all these ML frameworks (every transfer learning model, optimiser, loss function, &etc.)
- with a standard way of sharing architectures will be able to expand the ‘search-space’ with community contributed solutions.
this issue from Jul 20, 2019: Standard way of sharing architectures? · Issue #30896 · tensorflow/tensorflow · GitHub

Bhack · December 10, 2021, 8:51pm

From a more general point of view it reminds me of:

https://colinraffel.com/blog/a-call-to-build-models-like-we-build-open-source-software.html

The full twitter thread:

Bhack · December 10, 2021, 9:18pm

/cc @Jason @8bitmp3 @thea @yarri-oss Who could be interested in this thread?

Jason · December 10, 2021, 9:31pm

For tf-hub tagged posts maybe @lgusm has thoughts.

8bitmp3 · December 11, 2021, 12:10am

cc @mihaimaruseac

github.com

tensorflow/tensorflow/blob/master/SECURITY.md

# Using TensorFlow Securely

This document discusses how to safely deal with untrusted programs (models or
model parameters), and input data. Below, we also provide guidelines on how to
report vulnerabilities in TensorFlow.

## TensorFlow models are programs

TensorFlow's runtime system interprets and executes programs. What machine
learning practitioners term
[**models**](https://developers.google.com/machine-learning/glossary/#model) are
expressed as programs that TensorFlow executes.  TensorFlow programs are encoded
as computation
[**graphs**](https://developers.google.com/machine-learning/glossary/#graph).
The model's parameters are often stored separately in **checkpoints**.

At runtime, TensorFlow executes the computation graph using the parameters
provided. Note that the behavior of the computation graph may change
depending on the parameters provided. TensorFlow itself is not a sandbox. When
executing the computation graph, TensorFlow may read and write files, send and

This file has been truncated. show original

mihaimaruseac · December 11, 2021, 3:37am

Thank you for the tag but I might be missing something. I don’t know how the security policy applies to the thread. Can you please expand?

Coming back to the thread, @SamuelMarks, can we use TensorFlow’s DOI that is now attached (and different) to each release? Perhaps other ML frameworks can do similar DOI tagging of their releases. Or do we need additional stuff not captured by the DOI reference?

SamuelMarks · December 11, 2021, 4:17am

@Bhack Interesting find, I emailed the A/Professor and my last reply was:

I think it is possible to do what you propose, with a pip install style approach; whether or not released on PyPi.

The idea is to think of the ML, AI, and data-science tasks as not their own pipeline but as just part of the regular CI/CD pipeline (commonly with additional constraints to deploy to better hardware).

A huge problem in the industry and academia is the sheer quantity of new research coming out. This problem is exacerbated by the lack of standardisation in how these new transfer learning models, optimisers, loss functions, &etc. are exposed for others to consume. Usually it’s Jupyter Notebooks, or if you’re lucky maybe some .py files with absolute paths.

Almost never does one find a random research paper with an actual decent package; let alone expecting documentation and tests.

By standardising how the new model/optimiser/&etc. is exposed for consumption—e.g., setuptools and publish to PyPi; implementing an abstractclass that is ML-framework independent—one would solve two entire classes of problems, enabling the ML frameworks and new algorithms to be used as a database from which to experiment with new datasets, models, optimisers, &etc. off of.

Which, IMHO, would welcome in a new era of data science / machine learning / artificial intelligence for both research and industry.

@mihaimaruseac Not sure how the DOI would apply. If you’re thinking about the PyPi classifier… I envision a simple version number à la Django, Python, &etc.: Classifiers · PyPI (and additionally something more specific, like Machine Learning :: Optimizers

Bhack · December 11, 2021, 11:07am

I don’t understand the security pointer. I don’t know if It means that It Is too hard to collectively identify security threats in tensor programs as we are normally doing in other open source software/packages.
If this is the case I suppose that It Is more a question of having sandboxed runtime. E.g. TF.js in the browser it is probably more sandboxed than the regular Tensorflow.

SamuelMarks · December 11, 2021, 1:23pm

Also security can be considered later. I’ve got a bunch of ideas to do with parsing the codebase and removing anything that executes, that downloads to places outside of its directory, & related.

Happy to for the first pass to be Google repos exclusively (orgs: keras-team; tensorflow; google-research). Just to prove the concept and so you can ignore internal security concerns.

mihaimaruseac · December 13, 2021, 3:31pm

This still doesn’t make sense. For citation academics use DOI. Now code can also use DOI for each release, so each algorithm would be tagged automatically.

Adding trove classifiers means more work. You need to push to PyPI’s classifiers, wait for them to be deployed and only then can update your own repository. This is slow and, given that it requires a lot of manual work, will be ignored / stall.

8bitmp3 · December 13, 2021, 8:56pm

@mihaimaruseac Just shared the doc with everyone to raise awareness just in case

SamuelMarks · December 13, 2021, 9:47pm

RE: DOI - Yes I am—ahem, to some degree—academically inclined also.

One DOI wouldn’t be enough, oftentimes algorithms are made up of a multitude. Also versioning would still be needed, for new dependent framework support (e.g., TensorFlow 2 for a previously TF 1 implementation of DOI) and for bug fixing (sometimes even using different # bit types).

As for PEP301, I don’t see why new classifiers couldn’t be added for machine learning, artificial intelligence, specific machine learning frameworks (& versions), and whence in the hierarchy the given pip installable package sits. It doesn’t hurt that Google is funding some PyPi stuff now!

Now again this is just one proposal for how to handle the problem of high agility of new research and low consumption by industry and academia.

A related solution is to release packages to a packages [dot] tensorflow [dot] org [or similar] domain (which pip install supports with its rc or --index-url option). Short term and development-mode solution is to release it to the packages section of your source repository (e.g., releases on GitHub).

Unrelated solutions I am open to, and would welcome ideas.

yarri-oss · February 3, 2022, 7:39pm

@SamuelMarks thanks again for joining the SIG Addons meeting to discuss this. Please do keep working on this idea, and while I agree with Sean that SIG Addons is not the right venue for this topic, an RFC might be – or just develop a POC and share with this forum. I agree with Sean who tried to encourage you: if you can show it works I think you’ll find a lot of support in this forum and in the ML community in general !

SamuelMarks · February 3, 2022, 7:40pm

Summarised it [a different way] on a SIG-Addons call just now:

I want to build a package manager for machine learning models. With support for building ensembles from [fairly] arbitrary models, and mixing-and-matching components. When a new research article comes out, will send a PR to them to conform them to this new package-manager compliant specification.

Over time this will expose an ecosystem of models and approaches, useful, e.g., to double-check your approach is [still] best-of-breed.

Bhack · February 3, 2022, 9:20pm

I know that it is another a different approach but there are some Google researchers that are working on Network components reusability by a different point of view:

https://arxiv.org/abs/2004.03898