Skip to content

ShahinHasanov90/hs-code-classifier

Repository files navigation

HS Code Classifier

Python 3.9+ License MIT Build

ML-powered Harmonized System (HS) code classification from goods descriptions. Multi-level hierarchical classifier predicting chapter (2-digit), heading (4-digit), and subheading (6-digit) codes. Supports multi-language input (EN/RU/AZ) with TF-IDF + gradient boosting and optional transformer embeddings.


Table of Contents

Overview

The HS Code Classifier automates the assignment of Harmonized System codes to trade goods descriptions. It is designed for customs brokers, trade compliance teams, and logistics platforms that need fast, accurate tariff classification.

Key capabilities:

  • Hierarchical classification: Predicts codes at chapter (2-digit), heading (4-digit), and subheading (6-digit) levels in a top-down cascade.
  • Multi-language support: Processes descriptions in English, Russian, and Azerbaijani with language-aware preprocessing.
  • Confidence scoring: Returns calibrated confidence scores at each classification level.
  • Batch processing: Classify single items or thousands of descriptions in one call.
  • REST API: Production-ready FastAPI service with health checks and OpenAPI docs.

Model Architecture

Input Description (EN/RU/AZ)
        |
   Preprocessor
   (lowercasing, stopwords, abbreviation expansion, normalization)
        |
   Feature Extraction
   (TF-IDF + character n-grams + keyword/unit detection)
        |
   +-----------------------+
   | Chapter Classifier    |  (2-digit, LightGBM)
   +-----------------------+
        |
   +-----------------------+
   | Heading Classifier    |  (4-digit, LightGBM, conditioned on chapter)
   +-----------------------+
        |
   +-----------------------+
   | Subheading Classifier |  (6-digit, LightGBM, conditioned on heading)
   +-----------------------+
        |
   Output: { chapter, heading, subheading, confidences }

Each level is a separate LightGBM classifier. The heading classifier receives the predicted chapter as an additional feature, and the subheading classifier receives the predicted heading. This hierarchical cascade enforces consistency across levels.

Optional transformer embeddings (e.g., multilingual BERT) can be concatenated with TF-IDF features for improved accuracy on ambiguous descriptions.

Performance Metrics

Evaluated on a held-out test set of trade declarations:

Level Accuracy Top-3 Accuracy F1 (macro)
Chapter 92.4% 97.8% 0.91
Heading 85.1% 94.2% 0.83
Subheading 78.6% 91.5% 0.76

Installation

# Clone the repository
git clone https://github.com/shahinhasanov/hs-code-classifier.git
cd hs-code-classifier

# Install dependencies
make install

# Or manually
pip install -r requirements.txt
pip install -e .

Requirements: Python 3.9+

Quick Start

from classifier.model import HSClassifier

# Load a trained model
model = HSClassifier.load("models/hs_classifier.pkl")

# Classify a goods description
result = model.predict("fresh atlantic salmon fillets, frozen, 10 kg boxes")
print(result)
# {
#     "chapter": {"code": "03", "description": "Fish and crustaceans", "confidence": 0.96},
#     "heading": {"code": "0304", "description": "Fish fillets", "confidence": 0.91},
#     "subheading": {"code": "030414", "description": "Frozen fillets of salmon", "confidence": 0.87}
# }

API Usage

Start the server

make serve
# or
uvicorn classifier.api:app --host 0.0.0.0 --port 8000

Endpoints

POST /classify

Classify a single goods description.

curl -X POST http://localhost:8000/classify \
  -H "Content-Type: application/json" \
  -d '{"description": "polyethylene plastic bags", "language": "en"}'

Response:

{
  "chapter": {"code": "39", "description": "Plastics and articles thereof", "confidence": 0.94},
  "heading": {"code": "3923", "description": "Articles for conveyance or packing", "confidence": 0.88},
  "subheading": {"code": "392321", "description": "Sacks and bags of polymers of ethylene", "confidence": 0.82}
}

POST /classify/batch

Classify multiple descriptions in a single request.

curl -X POST http://localhost:8000/classify/batch \
  -H "Content-Type: application/json" \
  -d '{"items": [{"description": "cotton t-shirts"}, {"description": "steel bolts M10"}]}'

POST /suggest

Get top-K candidate codes with confidence scores.

curl -X POST http://localhost:8000/suggest \
  -H "Content-Type: application/json" \
  -d '{"description": "wooden furniture", "top_k": 5}'

GET /health

Health check endpoint.

curl http://localhost:8000/health

Training

# Train with default configuration
make train

# Or run the training script directly
python -m classifier.training --config config/model_config.yaml --data data/training_data.csv

Training configuration is managed via config/model_config.yaml. See the file for all available hyperparameters.

Project Structure

hs-code-classifier/
|-- src/
|   |-- classifier/
|       |-- __init__.py
|       |-- model.py            # Hierarchical classifier
|       |-- features.py         # Feature extraction (TF-IDF, n-grams)
|       |-- preprocessor.py     # Text preprocessing
|       |-- hierarchy.py        # HS code hierarchy management
|       |-- training.py         # Training pipeline
|       |-- api.py              # FastAPI endpoints
|       |-- schemas.py          # Pydantic schemas
|-- tests/
|   |-- test_model.py
|   |-- test_preprocessor.py
|   |-- test_hierarchy.py
|   |-- test_features.py
|-- data/
|   |-- hs_chapters.json        # HS chapter codes
|-- config/
|   |-- model_config.yaml       # Model configuration
|-- models/                     # Trained model artifacts
|-- requirements.txt
|-- setup.py
|-- Makefile
|-- Dockerfile
|-- LICENSE

Development

# Run tests
make test

# Run linter
make lint

# Clean build artifacts
make clean

License

MIT License. See LICENSE for details.

Copyright (c) 2022 Shahin Hasanov

About

ML-powered Harmonized System code classification from goods descriptions. Hierarchical TF-IDF + LightGBM with multi-language support (EN/RU/AZ).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors