Search
Basics about classifiers and teaching them.
Use confusion matrices to determine which image classifier is better - safer or leads to less unnecessary stops of the car (see pdf).
Even in the “age of AI”, Neural Networks, Transformers, etc. more traditional and simple methods like Naive Bayes and kNN are still used and sometimes have very close or better performance. These simple methods are less impacted by some of the issues with classification tasks, or downright ignore them and do not have to deal with such issues. Note: this was redacted with the help of
Examples:
- This paper, that shows that kNN with a creative distance estimation choice (with gzip) is close to neural approaches (including Transformers) for sentence classification, and is better in the case of out-of-domain datasets. Without requiring any training, tuning, or params. Intuition for the distance: two texts, if similar, when concatenating them, barely increases gzip size.
- varDial, a shared task where the goal is to create classifiers that work on languages that are very close (e.g. dialects of the same language), where the training data is often wikipedia, has historically seen the simplest models win at the task.
- This paper, trying to classify Perso-Arabic scripts, shows that some variation of Naive Bayes performs about as well as a Multi Layer Perceptrong, and significantly better than previous state-of-the art methods.
- This paper, using ngram models (no Neural network involved) for language identification, performing better than the State of the Art at the time, while at the same time enabling to add languages to the classifiers post-training without needed to re-train from scratch.
Some issues with classification, in particular models using Deep Neural Networks and similar, include:
* Usually, the bigger the models, the more data they need (though other variables have to be taken into account, such as the quality and variability of the data). Depending on the application, there might not be enough data to learn and let the models converge.
* The data might need to be labelled, which is not always available or doable, especially at the scale of the data needed (e.g. tens of thousand of images, or languages that are not much represented on the internet)
* Black box: it is difficult to really have an idea of what is going on in details inside the model. There are ways to analyse it, but considering the tendency to have ever bigger and bigger models with millions of parameters, this is basically impossible to do at such scales.
* Usually, the models are EXPENSIVE. Both in terms of (labelled?) data needed, which in itself can be expensive to generate, but as well in terms of infrastructure, facilities, computing power and time, man-hours of preparation and fine-tuning, etc.
* This often is also linked with pollution, and extensive use of ressources, for mere 0.01%s of improvements.
* Scalability issues come in many forms, such as modification to the classes usually leads to a full re-training of the model (you recognize dogs and cats, now you want to add giraffes, so you have to redo things from scratch with a new dataset that includes giraffes, and that might need different fine-tuning). This might also include needing a bigger facility and increasing computing power, amplifying related problems.
* Reproducibility: Because only those that can have a lot of computing power can use computationally-heavy models, it is actually difficult, if not outright impossible, to reproduce the results to validate the claims of the authors, AND it means there'll be little actual impact as this will not be able to be used in real life applications, unless the authors provide a way to access the trained model (e.g. ChatGPT website and APIs)
Work on the machine learning assignment.