How can I load NSynth dataset locally?

I’ve been attempting to load the NSynth dataset for use in tensorflow on my local machine. Google Collab is very powerful, but can’t really be used to write full python applications to my knowledge.

However, when using the normal function

ds = tfds.load('nsynth', split='train', shuffle_files=False, download=True,
                data_dir="data");

The data downloads fine, but the script ends silently unexpectedly, seemingly due to lack of disk space, despite there’s over 250GB available before running the script, and the dataset isn’t larger than this.

I’m not certain disk space is the issue, as the script fails silently after 30 minutes or so, and there is no verbose option for the load function.

How can I load it locally without freeing up more space?

Hi @GBDev, could you please try by providing a path where you want to download the dataset to data_dir argument in tfds.load( ). let us know if it helps or not. Thank You.

What exactly do you mean? I’ve specified a data_dir, as shown in the post, and it seems to use that.

Furthermore, a huge problem seems to be the dataset needing to be processed using Apache Beam once downloaded. The data seems to download fine but there’s not enough space on my computer to run apache beam properly? (Over 400GB free). Why is the dataset not ready for use on download? Why does it need processing?

Tried loading the smaller GANSynth dataset, and got a similar issue but this time with a stacktrace, though honestly it’s undescriptive.

Fatal Python error: Aborted

Thread 0x00002f08 (most recent call first):
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\threading.py", line 324 in wait
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\threading.py", line 607 in wait
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\apache_beam\runners\worker\data_plane.py", line 226 in run
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1016 in _bootstrap_inner
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\threading.py", line 973 in _bootstrap

Thread 0x000025bc (most recent call first):
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\tqdm\std.py", line 104 in acquire
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\tqdm\std.py", line 113 in __enter__
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\tqdm\_monitor.py", line 66 in run
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1016 in _bootstrap_inner
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\threading.py", line 973 in _bootstrap

Current thread 0x00002db0 (most recent call first):
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\apache_beam\runners\worker\data_plane.py", line 408 in add_to_inverse_output
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\apache_beam\runners\worker\data_plane.py", line 103 in close
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 1178 in _send_input_to_worker
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 1300 in process_bundle
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 999 in _run_bundle
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 770 in _execute_bundle
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 442 in run_stages
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 212 in run_via_runner_api
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\apache_beam\runners\portability\fn_api_runner\fn_runner.py", line 199 in run_pipeline
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\apache_beam\runners\direct\direct_runner.py", line 131 in run_pipeline
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\apache_beam\pipeline.py", line 574 in run
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\apache_beam\pipeline.py", line 597 in __exit__
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\tensorflow_datasets\core\split_builder.py", line 191 in maybe_beam_pipeline
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\contextlib.py", line 142 in __exit__
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\tensorflow_datasets\core\dataset_builder.py", line 1232 in _download_and_prepare
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\tensorflow_datasets\core\dataset_builder.py", line 523 in download_and_prepare
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\tensorflow_datasets\scripts\cli\build.py", line 397 in _download_and_prepare
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\tensorflow_datasets\scripts\cli\build.py", line 224 in _build_datasets
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\tensorflow_datasets\scripts\cli\main.py", line 99 in main
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\absl\app.py", line 254 in _run_main
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\absl\app.py", line 308 in run
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\site-packages\tensorflow_datasets\scripts\cli\main.py", line 104 in launch_cli
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\Scripts\tfds.exe\__main__.py", line 7 in <module>
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86 in _run_code
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196 in _run_module_as_main

Hi, I’m running into the same issue currently when trying to run Nsynth from magenta. It recommends downloading nsynth using tensorflow_datasets.scripts.download_and_prepare --datasets=nsynth/gansynth_subset and while it’s downloaded fine, it can’t finish processing the data for the same looking fatal python error. Did you ever figure out a fix to this?

By the way, for anyone who may stumble across this post in the future:

I solved this but only through running the same command using linux. It seems like this is a bug with either apache beam or with how tensorflow interacts with apache beam. Either way, in linux it’s either not used or used differently such that this error didn’t occur. Luckily I was downloading to an external drive so I could just bring it to a linux machine and run the command (then bring the downloaded dataset back) and it worked just fine.