Cluster images based on visual elements like color, lighting and composition

Hi Everyone

I am trying to automatically cluster bunch of images (200K) based on visual elements like color, composition, lighting, layout, etc.

What are some state of the start techniques and models that i can use to achieve the same.

  1. I already explored clustering based on clip embeddings. its clustering based on semantic topics
  2. I am trying out extracting features like dominant color, brighness, hue/saturation, etc and cluster them with a combination of above attributes

I would like to learn any other approaches that this community can suggest.

Thanks
Srini