IBM expects to create a kind of analogue of ImageNet for intelligent development tools, which has actually become the standard set of images for training AI models. At the THINK conference, the company announced that it has collected a huge array of source codes for this.
The set, called Project CodeNet, contains 14 million samples with a total volume of 500 million lines of code in more than 55 programming languages: from Java, C and Go to COBOL, Pascal and FORTRAN. However, more than three quarters of all code is in C ++ and Python.
The source of the code was two Japanese programming contests: Aizu and AtCoder. According to the terms of the contests, participants had to write the code necessary to turn a given set of inputs into a set of desired outputs for 4000 different problems. Thus, 14 million code samples were obtained, about half of which turned out to be working, and the rest were marked as uncompiled, incorrect or containing errors.
IBM хочет, чтобы проект CodeNet пошёл по стопам ImageNet и стал де-факто стандартным набором данных для обучения ИИ-моделей, способных распознавать структуру программ. Предполагается, что CodeNet можно будет использовать для создания интеллектуальных инструментов разработки, осуществляющих поиск нужных процедур в приложениях и библиотеках, перевод с одного языка программирования на другой, выбор правильных реализаций и отсев ошибочных, классификацию кода и так далее.
If you notice an error, select it with the mouse and press CTRL + ENTER. | Can you write better? We are always glad to new authors.
A source: