Cpu goes out of memory when i increase batch size from 2048 to 4096

I am creating a dataproc cluster
gcloud dataproc clusters create gputestsingleworker1
–service-account svceuclid@wmt-euclid-dev.iam.gserviceaccount.com
–region us-central1
–single-node
–enable-component-gateway
–subnet projects/shared-vpc-admin/regions/us-central1/subnetworks/prod-us-central1-02
–no-address
–num-masters 1
–master-accelerator type=nvidia-tesla-t4,count=1
–num-master-local-ssds=1
–master-machine-type n1-standard-48
–scopes cloud-platform
–project wmt-euclid-dev
–optional-components=JUPYTER
–initialization-actions gs://my-bucket-1322/requirements_geodemand.sh
–image onedemand-base-20230421
–properties=“^#^dataproc:dataproc.logging.stackdriver.job.driver.enable=true#dataproc:dataproc.logging.stackdriver.enable=true#dataproc:jobs.file-backed-output.enable=true#dataproc:dataproc.logging.stackdriver.job.yarn.container.enable=true#dataproc:dataproc.logging.stackdriver.enable=true#dataproc:dataproc.logging.stackdriver.job.yarn.container.enable=true#dataproc:dataproc.logging.stackdriver.enable=true#dataproc:startup.component.service-binding-timeout.hive-server2=15000#hive:hive.metastore.schema.verification=false#hive:javax.jdo.option.ConnectionURL=jdbc:mysql://bfd-mysql.gcp-prod.glb.us.walmart.net:3306/metastore#hive:javax.jdo.option.ConnectionUserName=gensmrtfrcst#hive:javax.jdo.option.ConnectionPassword=mdpKJZJ5jY350CYG”

@prakhar_agrawal Welcome to Tensorflow Forum !
Here are some strategies to address the CPU out-of-memory issue when increasing batch size:

  1. Reduce Batch Size:
  • If possible, stick with a batch size that fits within your memory constraints. Experiment with smaller batch sizes to find the optimal balance between memory usage and training efficiency.
  1. Optimize Data Types:
  • Use lower precision data types (e.g., float16 instead of float32) to reduce memory footprint. Consider mixed-precision training where possible.
  1. Employ Gradient Accumulation:
  • Split large batches into smaller chunks and accumulate gradients across multiple iterations to simulate a larger effective batch size without increasing memory usage in a single step.
  1. Utilize Gradient Checkpointing:
  • Store only a subset of gradients during backpropagation to reduce memory consumption. This technique can be particularly helpful for large models.
  1. Free Up Memory:
  • Close unnecessary applications and processes to allocate more memory for your TensorFlow model. Monitor memory usage using tools like nvidia-smi (for GPUs) or system-level utilities to identify potential bottlenecks.

Let us know if this helps!