lemon-manga-translator Models

KR | 한국어

lemon-manga-translator 파이프라인에서 사용하는 모델 가중치.

일본어 만화 페이지에서 텍스트를 검출하고, 원문을 지운 뒤, 한국어 번역을 렌더하는 파이프라인의 로컬 추론 모델 2종을 PyTorch(safetensors)와 ONNX 형태로 제공한다.

구조

pytorch/
├── comic-text-detector/
│   ├── config.json           # YOLOv5 아키텍처 설정
│   └── model.safetensors     # blk_det + text_seg 가중치
└── aot-inpainting/
    ├── config.json           # AOT-GAN 하이퍼파라미터
    └── model.safetensors     # AOT generator 가중치

onnx/
├── comic-text-detector/
│   └── ctd.onnx              # FP32, 1024×1024 고정, opset 17
└── aot-inpainting/
    └── aot_folded.onnx       # FP32, H/W 동적, opset 17, ScaledWS folded

모델 상세

Comic Text Detector (CTD)

YOLO bbox 검출 + UNet 픽셀 마스크를 동시에 출력하는 텍스트 검출 모델.

항목
아키텍처 YOLOv5s backbone + UNet mask head
입력 1024×1024 RGB (letterbox)
출력 bbox predictions (N×6) + mask (1×1×1024×1024)
PyTorch 63 MB (safetensors)
ONNX 77 MB

AOT Inpainting

마스크된 텍스트 영역을 자연스러운 배경으로 채우는 인페인팅 모델.

항목
아키텍처 AOT-GAN (Aggregated Contextual Transformations)
입력 image (1×3×H×W) + mask (1×1×H×W), H/W는 8의 배수
출력 inpainted (1×3×H×W), 값 범위 [-1, 1]
PyTorch 22 MB (safetensors)
ONNX 23 MB

추천 구성

용도 CTD AOT 비고
배포 (pip install) onnx/ctd.onnx onnx/aot_folded.onnx torch 불필요, onnxruntime만
연구 / 파인튜닝 pytorch/comic-text-detector/ pytorch/aot-inpainting/ PyTorch 필요

EN | English

Model weights for the lemon-manga-translator pipeline.

Two local inference models for detecting text in Japanese manga pages, inpainting the original text, and rendering Korean translations. Available in both PyTorch (safetensors) and ONNX formats.

Structure

pytorch/
├── comic-text-detector/
│   ├── config.json           # YOLOv5 architecture config
│   └── model.safetensors     # blk_det + text_seg weights
└── aot-inpainting/
    ├── config.json           # AOT-GAN hyperparameters
    └── model.safetensors     # AOT generator weights

onnx/
├── comic-text-detector/
│   └── ctd.onnx              # FP32, fixed 1024×1024, opset 17
└── aot-inpainting/
    └── aot_folded.onnx       # FP32, dynamic H/W, opset 17, ScaledWS folded

Model Details

Comic Text Detector (CTD)

Text detection model that outputs YOLO bboxes and UNet pixel masks simultaneously.

Field Value
Architecture YOLOv5s backbone + UNet mask head
Input 1024×1024 RGB (letterbox)
Output bbox predictions (N×6) + mask (1×1×1024×1024)
PyTorch 63 MB (safetensors)
ONNX 77 MB

AOT Inpainting

Inpainting model that fills masked text regions with natural background.

Field Value
Architecture AOT-GAN (Aggregated Contextual Transformations)
Input image (1×3×H×W) + mask (1×1×H×W), H/W must be multiple of 8
Output inpainted (1×3×H×W), value range [-1, 1]
PyTorch 22 MB (safetensors)
ONNX 23 MB

Recommended Configuration

Use Case CTD AOT Notes
Deployment (pip install) onnx/ctd.onnx onnx/aot_folded.onnx No torch needed, onnxruntime only
Research / Fine-tuning pytorch/comic-text-detector/ pytorch/aot-inpainting/ Requires PyTorch

JP | 日本語

lemon-manga-translator パイプラインで使用するモデルの重み。

日本語漫画ページからテキストを検出し、原文を消去した後、韓国語翻訳をレンダリングするパイプラインのローカル推論モデル2種を、PyTorch(safetensors)とONNX形式で提供する。

構造

pytorch/
├── comic-text-detector/
│   ├── config.json           # YOLOv5 アーキテクチャ設定
│   └── model.safetensors     # blk_det + text_seg 重み
└── aot-inpainting/
    ├── config.json           # AOT-GAN ハイパーパラメータ
    └── model.safetensors     # AOT generator 重み

onnx/
├── comic-text-detector/
│   └── ctd.onnx              # FP32、1024×1024固定、opset 17
└── aot-inpainting/
    └── aot_folded.onnx       # FP32、H/W動的、opset 17、ScaledWS folded

モデル詳細

Comic Text Detector (CTD)

YOLO bboxとUNetピクセルマスクを同時に出力するテキスト検出モデル。

項目
アーキテクチャ YOLOv5s backbone + UNet mask head
入力 1024×1024 RGB(letterbox)
出力 bbox predictions (N×6) + mask (1×1×1024×1024)
PyTorch 63 MB(safetensors)
ONNX 77 MB

AOT Inpainting

マスクされたテキスト領域を自然な背景で埋めるインペインティングモデル。

項目
アーキテクチャ AOT-GAN (Aggregated Contextual Transformations)
入力 image (1×3×H×W) + mask (1×1×H×W)、H/Wは8の倍数
出力 inpainted (1×3×H×W)、値範囲 [-1, 1]
PyTorch 22 MB(safetensors)
ONNX 23 MB

推奨構成

用途 CTD AOT 備考
デプロイ(pip install) onnx/ctd.onnx onnx/aot_folded.onnx torch不要、onnxruntimeのみ
研究 / ファインチューニング pytorch/comic-text-detector/ pytorch/aot-inpainting/ PyTorch必要

License

이 레포의 모든 모델은 GPL-3.0 라이선스를 따른다.

All models in this repository are licensed under GPL-3.0.

Model Source License
Comic Text Detector Original author dmMaze/comic-text-detector · Distributed via manga-image-translator beta-0.3 GPL-3.0
AOT Inpainting Architecture AOT-GAN (Zeng et al., 2021) · Fine-tuned by manga-image-translator beta-0.3 · Refactored by mayocream/aot-inpainting GPL-3.0

ONNX Conversion Notes

KR | 한국어

ONNX 변환 및 최적화는 LemonDouble/lemon-manga-translator에서 수행했다.

CTD (ctd.onnx)

  • torch.onnx.exportTextDetBase (YOLOv5 + UNet) 변환.
  • NMS는 모델 밖에서 후처리로 수행 (numpy 구현).
  • PyTorch 대비 실제 이미지 기준 bbox diff ≤ 1.9e-3 px, mask diff ≤ 1.5e-5.

AOT (aot_folded.onnx)

  • ScaledWSConv2d의 가중치 정규화를 fold — eval 모드에서 get_weight()의 결과(var_mean → rsqrt → scale)를 한 번 계산해 weight에 bake-in.
  • ONNX 그래프에서 var_mean/rsqrt/mul/sub 노드 수백 개 제거. 파일 크기 동일, 추론 시 불필요 연산 제거.
  • 원본 대비 max diff 5.7e-4 (parity 완벽).
  • H/W 동적 axes 지원 (8의 배수).

INT8 양자화 시도

두 모델 모두 INT8 dynamic quantization을 시도했으나 채택하지 않음:

  • CTD: NMS 검출 수 불일치 (LeakyReLU의 Mul 패턴으로 type inference 불완전).
  • AOT: PSNR 16dB로 완전 파괴 (GatedConv × sigmoid × 1.8 + custom LayerNorm × 5의 곱셈 체인이 양자화 오차를 지수 증폭).

Parity 검증 결과

비교 결과
PyTorch vs ONNX FP32 (CTD mask) 99.996% 픽셀 일치
PyTorch vs ONNX FP32 (AOT inpaint) 99.934% 픽셀 일치, max diff 47/255
PyTorch vs ONNX FP32 (최종 이미지) 99.2% 일치 (bubble fill 경계 노이즈)
torch NMS vs numpy NMS 100% 바이트 동일
AOT original vs AOT folded max diff 5.7e-4 (완벽)

벤치마크 (AMD Ryzen 9 3950X, CPU, 1397×1969 이미지)

PyTorch ONNX 최종 변화
CTD (검출) 784 ms 764 ms −3%
AOT (인페인팅) 2456 ms 2722 ms +11%
Pipeline 전체 ~4015 ms 3566 ms −11%
Peak RSS 1481 MB 1065 MB −29%
모듈 import RSS ~530 MB ~30 MB −94%
설치 크기 (런타임) ~200 MB ~20 MB −90%

측정 조건: warmup 1회 + 측정 3회 median. enable_cpu_mem_arena=False (CLI 배포 기본값).


EN | English

ONNX conversion and optimization were performed in LemonDouble/lemon-manga-translator.

CTD (ctd.onnx)

  • Converted TextDetBase (YOLOv5 + UNet) via torch.onnx.export.
  • NMS is performed outside the model as post-processing (numpy implementation).
  • Compared to PyTorch on real images: bbox diff ≤ 1.9e-3 px, mask diff ≤ 1.5e-5.

AOT (aot_folded.onnx)

  • Folded ScaledWSConv2d weight standardization — computed get_weight() result (var_mean → rsqrt → scale) once in eval mode and baked it into weights.
  • Removed hundreds of var_mean/rsqrt/mul/sub nodes from the ONNX graph. Same file size, fewer ops at inference.
  • Max diff vs original: 5.7e-4 (perfect parity).
  • Dynamic H/W axes supported (must be multiple of 8).

INT8 Quantization Attempts

INT8 dynamic quantization was attempted on both models but not adopted:

  • CTD: NMS detection count mismatch (type inference incomplete due to LeakyReLU Mul pattern).
  • AOT: PSNR 16dB, completely destroyed (GatedConv × sigmoid × 1.8 + custom LayerNorm × 5 multiplicative chain exponentially amplifies quantization error).

Parity Verification

Comparison Result
PyTorch vs ONNX FP32 (CTD mask) 99.996% pixel match
PyTorch vs ONNX FP32 (AOT inpaint) 99.934% pixel match, max diff 47/255
PyTorch vs ONNX FP32 (final image) 99.2% match (bubble fill boundary noise)
torch NMS vs numpy NMS 100% byte-identical
AOT original vs AOT folded max diff 5.7e-4 (perfect)

Benchmark (AMD Ryzen 9 3950X, CPU, 1397×1969 image)

PyTorch ONNX Final Change
CTD (detection) 784 ms 764 ms −3%
AOT (inpainting) 2456 ms 2722 ms +11%
Full pipeline ~4015 ms 3566 ms −11%
Peak RSS 1481 MB 1065 MB −29%
Module import RSS ~530 MB ~30 MB −94%
Install size (runtime) ~200 MB ~20 MB −90%

Measured with warmup 1 + 3 iterations median. enable_cpu_mem_arena=False (CLI deployment default).


JP | 日本語

ONNX変換および最適化は LemonDouble/lemon-manga-translator にて実施した。

CTD (ctd.onnx)

  • torch.onnx.exportTextDetBase(YOLOv5 + UNet)を変換。
  • NMSはモデル外で後処理として実行(numpy実装)。
  • PyTorch比、実画像基準で bbox diff ≤ 1.9e-3 px、mask diff ≤ 1.5e-5。

AOT (aot_folded.onnx)

  • ScaledWSConv2d の重み標準化をfold — evalモードで get_weight() の結果(var_mean → rsqrt → scale)を一度計算し、weightにbake-in。
  • ONNXグラフから var_mean/rsqrt/mul/sub ノード数百個を除去。ファイルサイズ同一、推論時の不要な演算を削減。
  • 元モデル比 max diff 5.7e-4(完全一致)。
  • H/W動的axes対応(8の倍数)。

INT8量子化の試み

両モデルともINT8動的量子化を試みたが採用せず:

  • CTD: NMS検出数不一致(LeakyReLUのMulパターンによりtype inference不完全)。
  • AOT: PSNR 16dBで完全に崩壊(GatedConv × sigmoid × 1.8 + custom LayerNorm × 5 の乗算チェーンが量子化誤差を指数的に増幅)。

Parity検証結果

比較 結果
PyTorch vs ONNX FP32 (CTD mask) 99.996% ピクセル一致
PyTorch vs ONNX FP32 (AOT inpaint) 99.934% ピクセル一致、max diff 47/255
PyTorch vs ONNX FP32 (最終画像) 99.2% 一致(bubble fill境界ノイズ)
torch NMS vs numpy NMS 100% バイト同一
AOT original vs AOT folded max diff 5.7e-4 (完全)

ベンチマーク(AMD Ryzen 9 3950X、CPU、1397×1969画像)

PyTorch ONNX最終 変化
CTD(検出) 784 ms 764 ms −3%
AOT(インペインティング) 2456 ms 2722 ms +11%
パイプライン全体 ~4015 ms 3566 ms −11%
Peak RSS 1481 MB 1065 MB −29%
モジュールimport RSS ~530 MB ~30 MB −94%
インストールサイズ(ランタイム) ~200 MB ~20 MB −90%

測定条件: warmup 1回 + 3回測定 median。enable_cpu_mem_arena=False(CLIデプロイデフォルト)。

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for lemondouble/lemon-manga-translator