Real Dataset Loader

Load real labeled datasets to train more accurate models

Auto-Discovered Local Datasets
FilePlatformAdapterLabel ColAction
instagram_real.csv Instagram generic label
final-v1.csv Unknown generic is_fake
tiktok_train.csv Tiktok generic label
reddit_train.csv Reddit generic label
linkedin_train.csv Linkedin generic label
github_train.csv Github generic label
snapchat_train.csv X twibot20 label
youtube_train.csv Youtube generic label
youtube_training.csv Youtube generic label
instagram_train.csv Instagram generic label
discord_profiles.csv Discord generic label
facebook_train.csv Facebook generic label
x_train.csv X generic label
discord_train.csv Discord generic label
Load from URL or File Path
Accepts: HTTP/HTTPS URLs (CSV or ZIP), absolute local file paths. The file must have a label column (e.g. label, fake, is_bot).
Known Public Datasets

These are well-known academic or community datasets. Some require manual download first.

Instagram Fake Profile Dataset (Local — 5,000 profiles)
instagram instagram_kaggle

5,000 Instagram profiles (2,500 fake / 2,500 real) with pre-computed features. Already present in Dataset/ folder.

TwiBot-22 (Twitter Bot Benchmark — 1M users, CC BY 4.0)
x twibot20

Largest Twitter bot benchmark: 1,000,000 accounts across 8 domains. Direct download from Zenodo (no login required).

Manual download needed
Direct download at https://zenodo.org/record/6950806 (CC BY 4.0, no login needed). Extract label.csv + user.json.
Cresci-2017 Twitter Social Spambots
x cresci17

Classic benchmark: genuine accounts + 3 types of social spambots (~14,000 labeled accounts). Free ZIP download.

Manual download needed
Download ZIP from http://mib.projects.iit.cnr.it/dataset.html (no login). Extract genuine_accounts/ and social_spambots_*/ CSVs.
BoDeGHa — GitHub Bot Detection Dataset (~5,700 accounts)
github generic

Ground-truth labeled GitHub bots vs humans. CSV files are directly in the GitHub repo — no auth needed.

Manual download needed
Clone or download CSV from https://github.com/mehdigolzadeh/BoDeGHa — label column is 'type' (Bot/Human).
OSoMe Bot Repository (Indiana University)
x generic

Aggregates 20+ labeled Twitter bot datasets. Many are direct ZIP downloads with no login.

Manual download needed
Browse individual datasets at https://botometer.osome.iu.edu/bot-repository/datasets.html
Live Public Profile Scrapers

Fetch real public profile data without API keys to build live training datasets. Install with pip install -r requirements.txt.

Instagram — instaloader
Public profiles, no authentication needed
import instaloader
L = instaloader.Instaloader()
p = instaloader.Profile.from_username(L.context, "cristiano")
print(p.followers, p.is_verified)
X/Twitter — ntscraper
Via Nitter — zero authentication
from ntscraper import Nitter
s = Nitter()
u = s.get_profile_info("elonmusk")
print(u['followers'])
Reddit — praw
Free API — register app at reddit.com/prefs/apps
import praw
r = praw.Reddit(client_id="...", client_secret="...", user_agent="...")
u = r.redditor("spez")
print(u.comment_karma, u.created_utc)
GitHub — PyGithub
Unauthenticated: 60 req/hr. Token: 5000/hr
from github import Github
g = Github()  # or Github("token")
u = g.get_user("torvalds")
print(u.followers, u.public_repos)
REST API — Programmatic Access

Integrate fake profile detection into your apps via JSON API.

Single Profile Prediction
POST /api/v1/predict/instagram
Content-Type: application/json

{
  "username": "cristiano",
  "followers": 650000000,
  "following": 500,
  "posts": 3800,
  "bio": "Official account",
  "is_verified": 1
}

→ { "prediction": "Legit", "confidence": 97.2, "is_fake": false }
Batch Prediction
POST /api/v1/predict/x
Content-Type: application/json

{
  "profiles": [
    { "username": "elonmusk", "followers": 200000000, "is_verified": 1 },
    { "username": "bot_9283", "followers": 5, "following": 4999 }
  ]
}

→ { "count": 2, "summary": { "Legit": 1, "Fake": 1 }, ... }