trainML Persistent Datasets allow you to reuse training data across multiple notebooks or training jobs. You can populate them directly from you local computer or another cloud provider.
Network storage makes it easy to share common files across many different instances concurrently. Unfortunately, network speeds just aren't fast enough to keep modern GPUs busy. To achieve maximum performance, deep learning jobs must use data stored on local NVMe storage, but constantly moving data in an out of network storage or replicating datasets across numerous instances is time consuming, error prone, and (on many cloud providers) expensive.
trainML Persistent Datasets give you NVMe performance while making sharing datasets across multiple jobs or job workers completely seamless. You can populate a dataset from a variety of different source types from the Datasets page. Once created, you can attach the dataset to any number of Notebooks or Training Jobs. The data will always be cached locally on NVMe storage, no matter how many jobs you use it on.
Datasets count only once towards you monthly storage charges, no matter how many jobs they are used on. With a 50 GB/month free tier, you can realize significant cost savings while enjoying the fastest storage technology around. Any storage used in excess of 50 GB/month incurs a charge of $0.20 per GB/month.
A significant percentage of trainML's customers use common, publicly available datasets on their jobs. Because of this, we store and maintain some of the most popular public datasets for all customers to use. You never incur storage charges for using a public dataset, and can attach them to both Notebooks and Training Jobs.
You can see the list of public datasets currently available here. If you would like a public dataset added, please contact us with a link to the dataset and a brief description of what you need it for.
trainML Datasets can be easily populated from the most popular cloud storage providers. However, frequently customers clean and transform datasets prior to training, which are then stored on their local network or laptop. In this case, you can use the local storage option to populate a dataset directly from a file path on your local computer. This advanced function requires the use of the trainML local connection capability, so ensure you satisfy the prerequisites before attempting to use the local data option.