-
Notifications
You must be signed in to change notification settings - Fork 124
Infrastructure: Change db from mariadb to postgres #711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
As part of this PR, on production, I am going to put the DB back on the main node. Both the main node and the db node sit almost entirely idle. |
Roughly speaking, this is the logic of doing the migration. Let's assume we merge the PR, and do something like:
We block anyone from connecting during the migration:
Then we save off the backup to /tmp
We'll want to update
(and also remove DB_EXTERNAL) And bring everything up, with the updates:
Create a temporary mariadb, which we are going to load all the data in.
We need to wait for the temp db to finish setting up:
Load the backup into the temp db:
Now, migrate the data into postgres. We are intentionally ignoring the comments/files tables (which contain old data and cause migration issues).
We should be good to go now. So, we let people connect again and cleanup:
|
Here's some profiling. This compares this PR on a mostly-idle box (postgres) against master on production (mariadb). It is unclear how much of the difference is due to the load conditions and how much is due to the database engines, but it should be noted that production was relatively quiet at the time of testing. Dojo Statsfrom CTFd.plugins.dojo_plugin.utils.stats import get_dojo_stats
os.environ["CACHE_WARMER"] = "true"
for dojo_id in ["computing-101", "welcome", "intro-to-cybersecurity", "cse365-s2025"]:
print(dojo_id)
dojo = Dojos.from_id(dojo_id).first()
%timeit -n1 -r3 get_dojo_stats(dojo) This PR on a mostly-idle box (postgres):
Master on production (mariadb):
Dojo Scoreboarddef get_scoreboard_for(model, duration):
duration_filter = (
Solves.date >= datetime.datetime.utcnow() - datetime.timedelta(days=duration)
if duration else True
)
solves = db.func.count().label("solves")
rank = (
db.func.row_number()
.over(order_by=(solves.desc(), db.func.max(Solves.id)))
.label("rank")
)
user_entities = [Solves.user_id, Users.name, Users.email]
query = (
model.solves()
.filter(duration_filter)
.group_by(*user_entities)
.order_by(rank)
.with_entities(rank, solves, *user_entities)
)
row_results = query.all()
results = [{key: getattr(item, key) for key in item.keys()} for item in row_results]
return results
for dojo_id in ["computing-101", "welcome", "intro-to-cybersecurity", "cse365-s2025"]:
print(dojo_id)
dojo = Dojos.from_id(dojo_id).first()
%timeit -n1 -r3 get_scoreboard_for(dojo, None) This PR on a mostly-idle box (postgres):
Master on production (mariadb):
Dojo Scoresdef scores_query(granularity, dojo_filter):
solve_count = db.func.count(Solves.id).label("solve_count")
last_solve_date = db.func.max(Solves.date).label("last_solve_date")
fields = granularity + [ Solves.user_id, solve_count, last_solve_date ]
grouping = granularity + [ Solves.user_id ]
dsc_query = db.session.query(*fields).where(
Dojos.dojo_id == DojoChallenges.dojo_id, DojoChallenges.challenge_id == Solves.challenge_id,
dojo_filter
).group_by(*grouping).order_by(Dojos.id, solve_count.desc(), last_solve_date)
return dsc_query
def dojo_scores():
dsc_query = scores_query([Dojos.id], or_(Dojos.data["type"].astext == "public", Dojos.official))
user_ranks = { }
user_solves = { }
dojo_ranks = { }
for dojo_id, user_id, solve_count, _ in dsc_query:
dojo_ranks.setdefault(dojo_id, [ ]).append(user_id)
user_ranks.setdefault(user_id, {})[dojo_id] = len(dojo_ranks[dojo_id])
user_solves.setdefault(user_id, {})[dojo_id] = solve_count
return {
"user_ranks": user_ranks,
"user_solves": user_solves,
"dojo_ranks": dojo_ranks
}
%timeit -n1 -r3 dojo_scores() This PR on a mostly-idle box (postgres):
Master on production (mariadb):
Module Scoresdef scores_query(granularity, dojo_filter):
solve_count = db.func.count(Solves.id).label("solve_count")
last_solve_date = db.func.max(Solves.date).label("last_solve_date")
fields = granularity + [ Solves.user_id, solve_count, last_solve_date ]
grouping = granularity + [ Solves.user_id ]
dsc_query = db.session.query(*fields).where(
Dojos.dojo_id == DojoChallenges.dojo_id, DojoChallenges.challenge_id == Solves.challenge_id,
dojo_filter
).group_by(*grouping).order_by(Dojos.id, solve_count.desc(), last_solve_date)
return dsc_query
def module_scores():
dsc_query = scores_query([Dojos.id, DojoChallenges.module_index], or_(Dojos.data["type"].astext == "public", Dojos.official))
user_ranks = { }
user_solves = { }
module_ranks = { }
for dojo_id, module_idx, user_id, solve_count, _ in dsc_query:
module_ranks.setdefault(dojo_id, {}).setdefault(module_idx, []).append(user_id)
user_ranks.setdefault(user_id, {}).setdefault(dojo_id, {})[module_idx] = len(module_ranks[dojo_id][module_idx])
user_solves.setdefault(user_id, {}).setdefault(dojo_id, {})[module_idx] = solve_count
return {
"user_ranks": user_ranks,
"user_solves": user_solves,
"module_ranks": module_ranks
}
%timeit -n1 -r3 module_scores() This PR on a mostly-idle box (postgres):
Master on production (mariadb):
|
This resolves #710.
TODO:
dojo db backup
dojo db restore
PG-JSON
TODOs (json incompatibility between sqlalchemy mariadb vs sqlalchemy postgres)