News

From a Summer Internship to a World-Class Conference: First-Year FEE CTU Student Points Out Flaws in the ImageNet Dataset

For students For graduates For applicants For employees

Illia Volkov, a first-year student in the Open Informatics (OI) bachelor's programme at Faculty of Electrical Engineering, CTU participated in the prestigious International Conference on Learning Representations (ICLR) in Singapore at the end of April 2025. This conference ranks among the world’s leading events in artificial intelligence. Illia presented his contribution, which focused on flaws of the ImageNet dataset and was developed in collaboration with Nikita Kisel, a master's student in the same programme.

ImageNet is a database containing over 14 million images and is commonly used for training machine learning algorithms in the computer vision field. To serve this purpose, each image is labeled with a category that informs the algorithm what the image depicts. This way,the algorithm learns to recognize images and, ideally, identifies what is shown in a new image it has never encountered before. It is therefore crucial that the labels assigned to images are accurate. Otherwise, the algorithm may learn to do so incorrectly.

Illia and Nikita focused on ImageNet-1k, a reduced version of the full dataset, containing “only” around 1.5 million images divided into 1,000 categories. In their paper, they revealed that the dataset contains mislabeled images, duplicate entries, and redundant categories. While these issues are relatively easy to fix, doing so can be time-consuming. “Before the conference, we spent a long time working on an application for annotators who will correct the dataset,” says Illia Volkov. However, annotators still need proper training and clearly defined instructions. Moreover, the two OI students discovered deeper issues for which there are no straightforward solutions.

The dataset used to train a machine learning algorithm is typically divided into three subsets: training, validation, and test datasets. The algorithm learns from the training dataset, its performance is then assessed on the validation dataset while its parameters are still being tuned, and final evaluation occurs on the test dataset. Each subset must contain distinct images, which has both advantages and drawbacks.

One drawback identified by Illia and Nikita in ImageNet-1k is the so-called distribution shift. This occurs when the distribution of image types differs across the training, validation, and test datasets. For instance (not necessarily in ImageNet-1k), 90% of the “cat” images in the training data set might depict orange cats, while in the validation and test datasets, the percentage of orange cats is significantly lower. As a result, the algorithm may learn to associate the “cat” label exclusively with orange fur, and then misclassify cats of other colors as dogs or rabbits during validation.

That said, distribution shift is not inherently bad. Real-world data are rarely independent and identically distributed. For example, image data from Arctic regions naturally feature more polar bears than images from southern regions, where black and brown bears are more common. Nonetheless, it’s generally beneficial for training datasets to be more balanced to avoid such problems. But modifying ImageNet-1k in this way would substantially alter the dataset, complicating comparisons with earlier research, since this dataset is considered a benchmark in the field. Hence, addressing this issue will not be easy.

Summer Internship at the Visual Recognition Group

This research was carried out under the supervision of Ing. Klára Janoušková and Prof. Jiří Matas from the Visual Recognition Group (VRG) at the Department of Cybernetics at FEE CTU. Illia joined the research group even before starting his first year, thanks to a summer internship offered by the Open Informatics programme to incoming students. He applied for the internship in hopes of earning money while working in his chosen field. “I was pleasantly surprised to be invited to a two-week internship in a lab that focuses on something I’m genuinely passionate about,” Illia recalls. After two weeks with VRG, he was offered the chance to stay and continue his work. In addition to gaining new knowledge and the flexibility to work during the semester and exam period, Illia also appreciates the friendliness of the team, which is always ready to help.

It was during this internship in the summer of 2024 when the foundation was laid for the paper Illia later presented at ICLR in Singapore. His work received highly positive feedback, even from leading figures in the AI community. “We often hear from experts how important this project is and that they had no idea how extensive the flaws in the dataset are. They also ask when the corrections will be ready,” says Ing. Janoušková. The whole team is currently working on a full-scale correction of the dataset, which is proving to be a complex and time-consuming process.

The accepted paper is available online at: https://iclr-blogposts.github.io/2025/blog/imagenet-flaws/ .

Photo: Illia Volkov

Responsible person Ing. Mgr. Radovan Suk