prithivMLmods commited on
Commit
23d28f6
·
verified ·
1 Parent(s): cd3abc6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +162 -1
README.md CHANGED
@@ -2,4 +2,165 @@
2
  license: apache-2.0
3
  language:
4
  - en
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  language:
4
  - en
5
+ datasets:
6
+ - mychen76/invoices-and-receipts_ocr_v1
7
+ - unsloth/LaTeX_OCR
8
+ - prithivMLmods/Latex-KIE
9
+ base_model:
10
+ - Qwen/Qwen2-VL-2B-Instruct
11
+ pipeline_tag: image-text-to-text
12
+ library_name: transformers
13
+ tags:
14
+ - text-generation-inference
15
+ - image-caption
16
+ - mini
17
+ - art explain
18
+ - visual report generation
19
+ - photo captions
20
+ - cutlines
21
+ - qwen2
22
+ - inscription subtitle
23
+ - representation
24
+ ---
25
+ ![0.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/Mu73LMUxQT6pkiRNaKFVj.png)
26
+
27
+ # **Imgscope-OCR-2B-0527**
28
+
29
+ > The **Imgscope-OCR-2B-0527** model is a fine-tuned version of *Qwen2-VL-2B-Instruct*, specifically optimized for *messy handwriting recognition*, *document OCR*, *realistic handwritten OCR*, and *math problem solving with LaTeX formatting*. This model is trained on custom datasets for document and handwriting OCR tasks and integrates a conversational approach with strong visual and textual understanding for multi-modal applications.
30
+
31
+ ---
32
+
33
+ ### Key Enhancements
34
+
35
+ * **SoTA Understanding of Images of Various Resolution & Ratio**
36
+ Imgscope-OCR-2B-0527 achieves state-of-the-art performance on visual understanding benchmarks such as MathVista, DocVQA, RealWorldQA, and MTVQA.
37
+
38
+ * **Enhanced Handwriting OCR**
39
+ Specifically optimized for recognizing and interpreting **realistic and messy handwriting** with high accuracy. Ideal for digitizing handwritten documents and notes.
40
+
41
+ * **Document OCR Fine-Tuning**
42
+ Fine-tuned with curated and realistic **document OCR datasets**, enabling accurate extraction of text from various structured and unstructured layouts.
43
+
44
+ * **Understanding Videos of 20+ Minutes**
45
+ Capable of processing long videos for **video-based question answering**, **transcription**, and **content generation**.
46
+
47
+ * **Device Control Agent**
48
+ Supports decision-making and control capabilities for integration with **mobile devices**, **robots**, and **automation systems** using visual-textual commands.
49
+
50
+ * **Multilingual OCR Support**
51
+ In addition to English and Chinese, the model supports **OCR in multiple languages** including European languages, Japanese, Korean, Arabic, and Vietnamese.
52
+
53
+ ---
54
+
55
+ ### How to Use
56
+
57
+ ```python
58
+ from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
59
+ from qwen_vl_utils import process_vision_info
60
+
61
+ # Load the model
62
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
63
+ "prithivMLmods/Callisto-OCR3-2B-Instruct", # replace with updated model ID if available
64
+ torch_dtype="auto",
65
+ device_map="auto"
66
+ )
67
+
68
+ # Optional: Flash Attention for performance optimization
69
+ # model = Qwen2VLForConditionalGeneration.from_pretrained(
70
+ # "prithivMLmods/Callisto-OCR3-2B-Instruct",
71
+ # torch_dtype=torch.bfloat16,
72
+ # attn_implementation="flash_attention_2",
73
+ # device_map="auto",
74
+ # )
75
+
76
+ # Load processor
77
+ processor = AutoProcessor.from_pretrained("prithivMLmods/Callisto-OCR3-2B-Instruct")
78
+
79
+ messages = [
80
+ {
81
+ "role": "user",
82
+ "content": [
83
+ {
84
+ "type": "image",
85
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
86
+ },
87
+ {"type": "text", "text": "Recognize the handwriting in this image."},
88
+ ],
89
+ }
90
+ ]
91
+
92
+ # Prepare input
93
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
94
+ image_inputs, video_inputs = process_vision_info(messages)
95
+ inputs = processor(
96
+ text=[text],
97
+ images=image_inputs,
98
+ videos=video_inputs,
99
+ padding=True,
100
+ return_tensors="pt",
101
+ )
102
+ inputs = inputs.to("cuda")
103
+
104
+ # Generate output
105
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
106
+ generated_ids_trimmed = [
107
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
108
+ ]
109
+ output_text = processor.batch_decode(
110
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
111
+ )
112
+ print(output_text)
113
+ ```
114
+
115
+ ---
116
+
117
+ ### Buffering Output (Streaming)
118
+
119
+ ```python
120
+ buffer = ""
121
+ for new_text in streamer:
122
+ buffer += new_text
123
+ buffer = buffer.replace("<|im_end|>", "")
124
+ yield buffer
125
+ ```
126
+
127
+ ---
128
+
129
+ ### Key Features
130
+
131
+ 1. **Realistic Messy Handwriting OCR**
132
+
133
+ * Fine-tuned for **complex and hard-to-read handwritten inputs** using real-world handwriting datasets.
134
+
135
+ 2. **Document OCR and Layout Understanding**
136
+
137
+ * Accurately extracts text from structured documents, including scanned pages, forms, and academic papers.
138
+
139
+ 3. **Image and Text Multi-modal Reasoning**
140
+
141
+ * Combines **vision-language capabilities** for tasks like captioning, answering image-based queries, and understanding image+text prompts.
142
+
143
+ 4. **Math Problem Solving and LaTeX Rendering**
144
+
145
+ * Converts mathematical expressions and problem-solving steps into **LaTeX** format.
146
+
147
+ 5. **Multi-turn Conversations**
148
+
149
+ * Supports **dialogue-based reasoning**, retaining context for follow-up questions.
150
+
151
+ 6. **Video + Image + Text-to-Text Generation**
152
+
153
+ * Accepts inputs from videos, images, or combined media with text, and generates relevant output accordingly.
154
+
155
+ ---
156
+
157
+ ## **Intended Use**
158
+
159
+ **Imgscope-OCR-2B-0527** is intended for:
160
+
161
+ * Handwritten and printed document digitization
162
+ * OCR pipelines for educational institutions and businesses
163
+ * Academic and scientific content parsing, especially math-heavy documents
164
+ * Assistive tools for visually impaired users
165
+ * Robotic and mobile automation agents interpreting screen or camera data
166
+ * Multilingual OCR processing for document translation or archiving