Real Dataset Loader
Load real labeled datasets to train more accurate models
Auto-Discovered Local Datasets
| File | Platform | Adapter | Label Col | Action |
|---|---|---|---|---|
instagram_real.csv
|
generic | label |
||
final-v1.csv
|
Unknown | generic | is_fake |
|
tiktok_train.csv
|
Tiktok | generic | label |
|
reddit_train.csv
|
generic | label |
||
linkedin_train.csv
|
generic | label |
||
github_train.csv
|
Github | generic | label |
|
snapchat_train.csv
|
X | twibot20 | label |
|
youtube_train.csv
|
Youtube | generic | label |
|
youtube_training.csv
|
Youtube | generic | label |
|
instagram_train.csv
|
generic | label |
||
discord_profiles.csv
|
Discord | generic | label |
|
facebook_train.csv
|
generic | label |
||
x_train.csv
|
X | generic | label |
|
discord_train.csv
|
Discord | generic | label |
Load from URL or File Path
Known Public Datasets
These are well-known academic or community datasets. Some require manual download first.
Instagram Fake Profile Dataset (Local — 5,000 profiles)
instagram instagram_kaggle5,000 Instagram profiles (2,500 fake / 2,500 real) with pre-computed features. Already present in Dataset/ folder.
TwiBot-22 (Twitter Bot Benchmark — 1M users, CC BY 4.0)
x twibot20Largest Twitter bot benchmark: 1,000,000 accounts across 8 domains. Direct download from Zenodo (no login required).
Cresci-2017 Twitter Social Spambots
x cresci17Classic benchmark: genuine accounts + 3 types of social spambots (~14,000 labeled accounts). Free ZIP download.
BoDeGHa — GitHub Bot Detection Dataset (~5,700 accounts)
github genericGround-truth labeled GitHub bots vs humans. CSV files are directly in the GitHub repo — no auth needed.
OSoMe Bot Repository (Indiana University)
x genericAggregates 20+ labeled Twitter bot datasets. Many are direct ZIP downloads with no login.
Live Public Profile Scrapers
Fetch real public profile data without API keys to build live training datasets.
Install with pip install -r requirements.txt.
instaloaderimport instaloader L = instaloader.Instaloader() p = instaloader.Profile.from_username(L.context, "cristiano") print(p.followers, p.is_verified)
ntscraperfrom ntscraper import Nitter
s = Nitter()
u = s.get_profile_info("elonmusk")
print(u['followers'])
prawimport praw
r = praw.Reddit(client_id="...", client_secret="...", user_agent="...")
u = r.redditor("spez")
print(u.comment_karma, u.created_utc)
PyGithubfrom github import Github
g = Github() # or Github("token")
u = g.get_user("torvalds")
print(u.followers, u.public_repos)
REST API — Programmatic Access
Integrate fake profile detection into your apps via JSON API.
Single Profile Prediction
POST /api/v1/predict/instagram
Content-Type: application/json
{
"username": "cristiano",
"followers": 650000000,
"following": 500,
"posts": 3800,
"bio": "Official account",
"is_verified": 1
}
→ { "prediction": "Legit", "confidence": 97.2, "is_fake": false }
Batch Prediction
POST /api/v1/predict/x
Content-Type: application/json
{
"profiles": [
{ "username": "elonmusk", "followers": 200000000, "is_verified": 1 },
{ "username": "bot_9283", "followers": 5, "following": 4999 }
]
}
→ { "count": 2, "summary": { "Legit": 1, "Fake": 1 }, ... }