Develop accurate machine learning models with the most comprehensive and continuous set of data. Train and refine machine learning algorithms across geographies. Avoid having to accept the limitations of sample data that has been shipped to a central location.
Machine learning models are only as good as the data used to build them. For machine learning to be highly accurate, it requires algorithms that have been tested on very large sets of data. Data scientists are faced with an unwanted trade-off on deciding which data to sample for training and refinement – because it is often impractical to move it all to a central location.
A new approach, distributed and federated at the edge, exploits the potential of machine learning to its fullest. Train and refine algorithms using the most recent and complete data-set. Apply large amounts of geographically distributed data regardless of whether it is seconds or years old. Perform machine learning on sensitive data that can’t be moved as a result of regulatory laws or privacy concerns.
Integrated Apache MADlib serves as the foundation for performing geographically distributed machine learning. This allows machine learning to be performed in a massively parallel manner across a distributed set of edge locations. Data is operated upon locally, within the database. The result is access to a complete set of data without having to sample, transfer elsewhere and compromise accuracy.
Developers and data scientists can start quickly by leveraging open source algorithms. Supervised learning, unsupervised learning, time series, nearest neighbors and other methods can be performed on a system that scales to petabytes of data stored. The open source community continues to add new analytical methods. Data transformations, statistics and graphical capabilities are also available.
Developers and data scientists who know R but very little SQL can leverage the performance and scalability benefits of MADlib. The system translates R model formulas into corresponding SQL statements, executes these statements in the database, and returns the summarized model output to R. Alternatively, machine learning can be performed directly in SQL for those with expertise in this area.
Distributed data stored within the platform works together with machine learning in a unified design. There’s no need to export data from a data warehouse to a separate machine learning environment prior to analyzing data. Spend more time developing accurate models instead of moving data cross disparate systems. Pricing is predictable and consistent across data warehouse and machine learning to help avoid the economic inhibitors and uncertainty that are often associated with cloud-based services.