What are the limits of client side loading with tfjs?

I have a web app client on a nuxt/vue/vuetify/firestore/gcp stack. We are porting our traditional deep net NLP predictions from offline python/tf/keras to tfjs in the webclient.

I know we will retrain the models in tfjs on a back-end node.js server. The question is where to load the models for prediction. I see my options as:

  1. a) serve the models from some where to load as url
    b) Then use node.js load call to load 1 or more of 25 mdls
    c) make a client side 1 to many predictions
    Q?) Is this too much load on client? How much can it handle (in general terms)?

  2. a) build a simple node.js backend.
    b) house the mdls locally in this stack
    c) use REST API to load model on back end server
    d) make one to many predictions
    Q?) seems more feasible, but how to manage the listene, uptime, etc and support long-running calls?

If anyone has some practical experience, suggestions, or benchmarks that will help us find a path, I really appreciate it. :slight_smile:

Sorry for the delayed reply (better late than never) somehow this did not reach my inbox.

  1. You have two options for loading/using model:
    a) Server side via Node.js or whatever backend you are using and execute via CUDA acceleration etc as many folk do.
    b) Client side in the browser on the user’s device. In which case performance will depend on the type of device they have. Clearly a mobile phone will likely be slower than a desktop PC with GPU etc.

That being said, we get some really fast results on the client side for many things so it will come down to your ML model architecture being efficient and such which is very custom whatever you are running. These days we are seeing folk even port LLMs to run in browser which can run pretty fast on say a mid range laptop. So I would encourage you to survey what devices your end users typically use and then test the perf on such laptops to see if feasible to offload to client side.

You could also take a hybrid approach where by if they have a modern fast device you offload to client side and if they are on an old machine that is slow you simply use cloud based inference via Node or whatever you are using. At least that way you save some costs and more with time as ppl get faster machines over time.