tasdid25 commited on
Commit
ecbefb3
·
verified ·
1 Parent(s): d7ff29b

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +117 -0
  2. config.json +67 -0
  3. model.py +245 -0
  4. requirements.txt +4 -0
  5. train_model.py +36 -0
README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - recommendation-system
5
+ - collaborative-filtering
6
+ - matrix-factorization
7
+ - movie-recommendations
8
+ - movielens
9
+ - machine-learning
10
+ library_name: scikit-learn
11
+ ---
12
+
13
+ # DataSynthis_ML_JobTask
14
+
15
+ A powerful movie recommendation system using collaborative filtering and matrix factorization techniques on the MovieLens 100k dataset.
16
+
17
+ ## Model Description
18
+
19
+ This model provides personalized movie recommendations using two state-of-the-art algorithms:
20
+
21
+ - **Collaborative Filtering (CF)**: Item-based similarity using cosine similarity
22
+ - **Matrix Factorization (SVD)**: Singular Value Decomposition for dimensionality reduction
23
+
24
+ ## Dataset
25
+
26
+ - **MovieLens 100k**: 100,000 ratings from 943 users on 1,682 movies
27
+ - **User ID Range**: 1-943
28
+ - **Movie Count**: 1,682 unique movies
29
+ - **Rating Scale**: 1-5 stars
30
+
31
+ ## Usage
32
+
33
+ ### Python
34
+
35
+ ```python
36
+ from model import predict
37
+
38
+ # Get recommendations using SVD (default)
39
+ recommendations = predict(user_id=1, n_recommendations=10, method="svd")
40
+
41
+ # Get recommendations using collaborative filtering
42
+ recommendations = predict(user_id=1, n_recommendations=10, method="cf")
43
+
44
+ print(recommendations)
45
+ ```
46
+
47
+ ### Parameters
48
+
49
+ - **user_id** (int): User ID between 1-943 (required)
50
+ - **n_recommendations** (int): Number of recommendations between 1-20 (default: 10)
51
+ - **method** (str): "svd" for matrix factorization or "cf" for collaborative filtering (default: "svd")
52
+
53
+ ### Output
54
+
55
+ Returns a list of dictionaries with movie recommendations:
56
+
57
+ ```json
58
+ [
59
+ {
60
+ "movie_id": 50,
61
+ "title": "Star Wars (1977)",
62
+ "predicted_rating": 4.5
63
+ },
64
+ {
65
+ "movie_id": 181,
66
+ "title": "Return of the Jedi (1983)",
67
+ "predicted_rating": 4.3
68
+ }
69
+ ]
70
+ ```
71
+
72
+ ## Model Performance
73
+
74
+ - **SVD Method**: Fast predictions with good accuracy using 20 components
75
+ - **Collaborative Filtering**: More interpretable, based on item similarity
76
+ - **Cold Start Handling**: Graceful error handling for unknown users
77
+
78
+ ## Technical Details
79
+
80
+ - **Framework**: Scikit-learn
81
+ - **Algorithms**: TruncatedSVD, Cosine Similarity
82
+ - **Data Processing**: Pandas for efficient matrix operations
83
+ - **Memory Efficient**: Optimized for large-scale recommendation tasks
84
+
85
+ ## Installation
86
+
87
+ ```bash
88
+ pip install pandas numpy scikit-learn
89
+ ```
90
+
91
+ ## Training
92
+
93
+ The model is pre-trained on the MovieLens 100k dataset. To retrain:
94
+
95
+ ```python
96
+ from model import MovieRecommender
97
+
98
+ model = MovieRecommender()
99
+ model.load_data()
100
+ model.train()
101
+ model.save_model("movie_recommender.pkl")
102
+ ```
103
+
104
+ ## Citation
105
+
106
+ ```bibtex
107
+ @misc{datasynthis_ml_jobtask,
108
+ title={DataSynthis ML JobTask: Movie Recommendation System},
109
+ author={tasdid25},
110
+ year={2025},
111
+ url={https://huggingface.co/tasdid25/DataSynthis_ML_JobTask}
112
+ }
113
+ ```
114
+
115
+ ## License
116
+
117
+ MIT License - see LICENSE file for details.
config.json ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "movie_recommendation",
3
+ "name": "DataSynthis_ML_JobTask",
4
+ "description": "Movie recommendation system using collaborative filtering and matrix factorization",
5
+ "version": "1.0.0",
6
+ "author": "tasdid25",
7
+ "license": "MIT",
8
+ "framework": "scikit-learn",
9
+ "algorithms": [
10
+ "collaborative_filtering",
11
+ "matrix_factorization_svd"
12
+ ],
13
+ "dataset": "movielens_100k",
14
+ "features": {
15
+ "user_id_range": [1, 943],
16
+ "movie_count": 1682,
17
+ "rating_count": 100000,
18
+ "recommendation_methods": ["svd", "cf"],
19
+ "max_recommendations": 20
20
+ },
21
+ "input_schema": {
22
+ "user_id": {
23
+ "type": "integer",
24
+ "description": "User ID (1-943)",
25
+ "required": true
26
+ },
27
+ "n_recommendations": {
28
+ "type": "integer",
29
+ "description": "Number of recommendations (1-20)",
30
+ "default": 10,
31
+ "required": false
32
+ },
33
+ "method": {
34
+ "type": "string",
35
+ "description": "Recommendation method",
36
+ "enum": ["svd", "cf"],
37
+ "default": "svd",
38
+ "required": false
39
+ }
40
+ },
41
+ "output_schema": {
42
+ "type": "array",
43
+ "items": {
44
+ "type": "object",
45
+ "properties": {
46
+ "movie_id": {
47
+ "type": "integer",
48
+ "description": "Movie ID"
49
+ },
50
+ "title": {
51
+ "type": "string",
52
+ "description": "Movie title"
53
+ },
54
+ "predicted_rating": {
55
+ "type": "number",
56
+ "description": "Predicted rating for the user"
57
+ }
58
+ }
59
+ }
60
+ },
61
+ "dependencies": [
62
+ "pandas>=2.0.0",
63
+ "numpy>=1.24.0",
64
+ "scikit-learn>=1.3.0"
65
+ ],
66
+ "inference_function": "predict"
67
+ }
model.py ADDED
@@ -0,0 +1,245 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ DataSynthis_ML_JobTask - Movie Recommendation Model
3
+ A movie recommendation system using collaborative filtering and matrix factorization.
4
+ """
5
+
6
+ import pandas as pd
7
+ import numpy as np
8
+ from sklearn.metrics.pairwise import cosine_similarity
9
+ from sklearn.decomposition import TruncatedSVD
10
+ import os
11
+ import urllib.request
12
+ import zipfile
13
+ import pickle
14
+ from typing import List, Dict, Optional, Union
15
+
16
+
17
+ class MovieRecommender:
18
+ """
19
+ Movie Recommendation Model using collaborative filtering and SVD.
20
+ """
21
+
22
+ def __init__(self):
23
+ self.ratings = None
24
+ self.movies = None
25
+ self.user_item_matrix = None
26
+ self.item_similarity = None
27
+ self.item_similarity_df = None
28
+ self.svd_model = None
29
+ self.pred_svd_df = None
30
+ self.is_trained = False
31
+
32
+ def load_data(self):
33
+ """Load MovieLens 100k dataset."""
34
+ dataset_url = "http://files.grouplens.org/datasets/movielens/ml-100k.zip"
35
+ dataset_path = "ml-100k"
36
+
37
+ if not os.path.exists(dataset_path):
38
+ if os.path.exists("ml-100k.zip"):
39
+ print("Extracting existing MovieLens 100k dataset...")
40
+ with zipfile.ZipFile("ml-100k.zip", "r") as zip_ref:
41
+ zip_ref.extractall(".")
42
+ print("Extraction complete.")
43
+ else:
44
+ print("Downloading MovieLens 100k dataset...")
45
+ try:
46
+ urllib.request.urlretrieve(dataset_url, "ml-100k.zip")
47
+ with zipfile.ZipFile("ml-100k.zip", "r") as zip_ref:
48
+ zip_ref.extractall(".")
49
+ print("Download complete.")
50
+ except Exception as e:
51
+ print(f"Download failed: {e}")
52
+ raise Exception("Could not download dataset")
53
+
54
+ # Load ratings
55
+ self.ratings = pd.read_csv(
56
+ "ml-100k/u.data",
57
+ sep="\t",
58
+ names=["user_id", "movie_id", "rating", "timestamp"]
59
+ )
60
+
61
+ # Load movies
62
+ self.movies = pd.read_csv(
63
+ "ml-100k/u.item",
64
+ sep="|",
65
+ encoding="ISO-8859-1",
66
+ names=["movie_id", "title", "release_date", "video_release_date", "IMDb_URL",
67
+ "unknown", "Action", "Adventure", "Animation", "Children", "Comedy",
68
+ "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
69
+ "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"]
70
+ )
71
+
72
+ # Remove timestamp column
73
+ self.ratings.drop("timestamp", axis=1, inplace=True)
74
+
75
+ print(f"Loaded {len(self.ratings)} ratings from {len(self.ratings['user_id'].unique())} users")
76
+ print(f"Loaded {len(self.movies)} movies")
77
+
78
+ def train(self):
79
+ """Train the recommendation models."""
80
+ if self.ratings is None:
81
+ self.load_data()
82
+
83
+ # Create user-item matrix
84
+ self.user_item_matrix = self.ratings.pivot(
85
+ index='user_id', columns='movie_id', values='rating'
86
+ )
87
+
88
+ # Collaborative Filtering - Item-based similarity
89
+ self.item_similarity = cosine_similarity(self.user_item_matrix.T.fillna(0))
90
+ self.item_similarity_df = pd.DataFrame(
91
+ self.item_similarity,
92
+ index=self.user_item_matrix.columns,
93
+ columns=self.user_item_matrix.columns
94
+ )
95
+
96
+ # SVD - Matrix Factorization
97
+ R = self.user_item_matrix.fillna(0)
98
+ self.svd_model = TruncatedSVD(n_components=20, random_state=42)
99
+ U = self.svd_model.fit_transform(R)
100
+ Sigma = np.diag(self.svd_model.singular_values_)
101
+ Vt = self.svd_model.components_
102
+ pred_svd = np.dot(np.dot(U, Sigma), Vt)
103
+ self.pred_svd_df = pd.DataFrame(pred_svd, index=R.index, columns=R.columns)
104
+
105
+ self.is_trained = True
106
+ print("Model training completed!")
107
+
108
+ def predict_ratings_cf(self, user_id: int) -> pd.Series:
109
+ """Predict ratings using collaborative filtering."""
110
+ if not self.is_trained:
111
+ raise ValueError("Model must be trained first")
112
+
113
+ if user_id not in self.user_item_matrix.index:
114
+ raise ValueError(f"User {user_id} not found in dataset")
115
+
116
+ user_ratings = self.user_item_matrix.loc[user_id]
117
+ weighted_sum = self.item_similarity_df.dot(user_ratings.fillna(0))
118
+ sim_sum = np.abs(self.item_similarity_df).dot(user_ratings.notna().astype(int))
119
+ pred = weighted_sum / np.maximum(sim_sum, 1e-9)
120
+ return pred
121
+
122
+ def recommend_movies(self, user_id: int, n_recommendations: int = 10,
123
+ method: str = "svd") -> List[Dict]:
124
+ """
125
+ Get movie recommendations for a user.
126
+
127
+ Args:
128
+ user_id: User ID to get recommendations for
129
+ n_recommendations: Number of recommendations to return
130
+ method: "svd" or "cf" (collaborative filtering)
131
+
132
+ Returns:
133
+ List of dictionaries with movie recommendations
134
+ """
135
+ if not self.is_trained:
136
+ self.train()
137
+
138
+ # Check if user exists
139
+ if user_id not in self.user_item_matrix.index:
140
+ available_users = sorted(self.user_item_matrix.index.tolist())
141
+ return [{
142
+ "error": f"User {user_id} not found",
143
+ "available_users": f"Available user IDs: {available_users[:10]}... (showing first 10)"
144
+ }]
145
+
146
+ # Get predictions
147
+ if method == "svd":
148
+ preds = self.pred_svd_df.loc[user_id]
149
+ else: # collaborative filtering
150
+ preds = self.predict_ratings_cf(user_id)
151
+
152
+ # Remove already watched movies
153
+ watched = self.ratings[self.ratings.user_id == user_id].movie_id.values
154
+ preds = preds.drop(watched, errors='ignore')
155
+
156
+ # Get top recommendations
157
+ top_movies = preds.sort_values(ascending=False).head(n_recommendations).index
158
+ recommendations = self.movies[self.movies.movie_id.isin(top_movies)][["movie_id", "title"]]
159
+
160
+ # Convert to list of dictionaries
161
+ result = []
162
+ for _, row in recommendations.iterrows():
163
+ result.append({
164
+ "movie_id": int(row["movie_id"]),
165
+ "title": row["title"],
166
+ "predicted_rating": float(preds[row["movie_id"]])
167
+ })
168
+
169
+ return result
170
+
171
+ def get_user_stats(self, user_id: int) -> Dict:
172
+ """Get statistics for a user."""
173
+ if not self.is_trained:
174
+ self.train()
175
+
176
+ if user_id not in self.user_item_matrix.index:
177
+ return {"error": f"User {user_id} not found"}
178
+
179
+ user_ratings = self.ratings[self.ratings.user_id == user_id]
180
+
181
+ return {
182
+ "user_id": user_id,
183
+ "total_ratings": len(user_ratings),
184
+ "average_rating": float(user_ratings["rating"].mean()),
185
+ "rating_distribution": user_ratings["rating"].value_counts().to_dict()
186
+ }
187
+
188
+ def get_available_users(self) -> List[int]:
189
+ """Get list of available user IDs."""
190
+ if not self.is_trained:
191
+ self.train()
192
+ return sorted(self.user_item_matrix.index.tolist())
193
+
194
+ def save_model(self, path: str):
195
+ """Save the trained model."""
196
+ if not self.is_trained:
197
+ raise ValueError("Model must be trained first")
198
+
199
+ model_data = {
200
+ 'ratings': self.ratings,
201
+ 'movies': self.movies,
202
+ 'user_item_matrix': self.user_item_matrix,
203
+ 'item_similarity_df': self.item_similarity_df,
204
+ 'svd_model': self.svd_model,
205
+ 'pred_svd_df': self.pred_svd_df,
206
+ 'is_trained': self.is_trained
207
+ }
208
+
209
+ with open(path, 'wb') as f:
210
+ pickle.dump(model_data, f)
211
+
212
+ print(f"Model saved to {path}")
213
+
214
+ def load_model(self, path: str):
215
+ """Load a trained model."""
216
+ with open(path, 'rb') as f:
217
+ model_data = pickle.load(f)
218
+
219
+ self.ratings = model_data['ratings']
220
+ self.movies = model_data['movies']
221
+ self.user_item_matrix = model_data['user_item_matrix']
222
+ self.item_similarity_df = model_data['item_similarity_df']
223
+ self.svd_model = model_data['svd_model']
224
+ self.pred_svd_df = model_data['pred_svd_df']
225
+ self.is_trained = model_data['is_trained']
226
+
227
+ print(f"Model loaded from {path}")
228
+
229
+
230
+ # Create a global model instance for inference
231
+ model = MovieRecommender()
232
+
233
+ def predict(user_id: int, n_recommendations: int = 10, method: str = "svd") -> List[Dict]:
234
+ """
235
+ Inference function for Hugging Face model.
236
+
237
+ Args:
238
+ user_id: User ID to get recommendations for
239
+ n_recommendations: Number of recommendations (default: 10)
240
+ method: Recommendation method - "svd" or "cf" (default: "svd")
241
+
242
+ Returns:
243
+ List of movie recommendations
244
+ """
245
+ return model.recommend_movies(user_id, n_recommendations, method)
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ pandas>=2.0.0
2
+ numpy>=1.24.0
3
+ scikit-learn>=1.3.0
4
+ huggingface_hub>=0.20.0
train_model.py ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Training script for DataSynthis_ML_JobTask model.
3
+ This script trains the model and saves it for deployment.
4
+ """
5
+
6
+ from model import MovieRecommender
7
+ import os
8
+
9
+ def main():
10
+ """Train and save the movie recommendation model."""
11
+ print("Starting model training...")
12
+
13
+ # Initialize model
14
+ model = MovieRecommender()
15
+
16
+ # Train the model
17
+ model.train()
18
+
19
+ # Save the trained model
20
+ model.save_model("movie_recommender.pkl")
21
+
22
+ print("Model training completed and saved!")
23
+
24
+ # Test the model
25
+ print("\nTesting model with user ID 1...")
26
+ recommendations = model.recommend_movies(user_id=1, n_recommendations=5, method="svd")
27
+
28
+ print("Sample recommendations:")
29
+ for rec in recommendations:
30
+ print(f"- {rec['title']} (ID: {rec['movie_id']}, Rating: {rec['predicted_rating']:.2f})")
31
+
32
+ print(f"\nAvailable users: {len(model.get_available_users())}")
33
+ print(f"User ID range: {min(model.get_available_users())} - {max(model.get_available_users())}")
34
+
35
+ if __name__ == "__main__":
36
+ main()