Enhance Your Training Data with New NVIDIA NeMo Curator Classifier Models

Classifier models are specialized in categorizing data into predefined groups or classes, playing a crucial role in optimizing data processing pipelines for fine-tuning and pretraining generative AI models. Their value lies in enhancing data quality by filtering out low-quality or toxic data, ensuring only clean and relevant information feeds downstream processes.

Beyond filtering, classifier models add value through data enrichment, annotating data with metadata like domain, type, or content specifics and creative quality-specific blends. These capabilities not only streamline data preparation but also provide insights into how models are used in production by the users. For example, classifiers can help understand the complexity and the domain of the user prompts and developers can route those prompts to the most suitable models.

The NVIDIA NeMo Curator team has previously released two classifier models:

Domain Classifier: A text classification model to classify documents into one of 26 domain classes
Quality Classifier DeBERTa: A text classification model that classifies documents into one of three classes (High, Medium, or Low) based on the quality of the document

In addition to the BERT style classifier models, NeMo Curator also supports n-gram based bag-of-words classifiers like fastText and data labeling using large language models (LLMs) and reward models.

In this post, we discuss the four new NeMo Curator classifier models:

Prompt Task and Complexity Classifier: A multiheaded model that classifies English text prompts across 11 task types such as Open QA, Chatbot, and Text Generation, as well as six complexity dimensions, including Creativity, Domain Knowledge, and Reasoning. Developers can leverage this model for tasks such as prompt routing and understanding user prompts.
Instruction Data Guard: A deep learning classification model that helps identify LLM poisoning attacks in datasets, generates a score, and predicts whether the input data is benign or poisonous.
Multilingual Domain Classifier: A multilingual text classification model that categorizes content in 52 languages, including English, Chinese, Arabic, Spanish, and Hindi, across 26 domains such as Arts, Business, Science, and Technology.
Content Type Classifier DeBERTa: A text classification model designed to categorize documents into one of 11 distinct speech types based on their content, such as Blogs, News, and Reviews.

Task	Overall complexity	Creativity	Reasoning	Contextual knowledge	Domain knowledge	Constraints	# of few shots
Text generation	0.472	0.867	0.056	0.048	0.226	0.786	0

Enhance Your Training Data with New NVIDIA NeMo Curator Classifier Models

Overview of NVIDIA NeMo Curator

Accelerated large-scale inference with NeMo Curator

Prompt Task and Complexity Classifier

Example input

Output

Instruction Data Guard

Example input

Output

Multilingual Domain Classifier

Example input

Output

Content Type Classifier DeBERTa

Example input

Output

Get started

latest articles

explore more

most viewed

trending right now