Support [rolling] upgrade of HDFS #362

sbernauer · 2023-06-15T06:48:46Z

As of 23.4, when you upgrade your hdfs, e.g. 3.2.2 -> 3.3.4 you run into the error

2023-06-15 06:21:23,060 ERROR namenode.NameNode (NameNode.java:main(1839)) - Failed to start namenode.                                                                                 
java.io.IOException:                                                                                                                                                                   
File system image contains an old layout version -65.                                                                                                                                  
An upgrade to version -66 is required.                                                                                                                                                 
Please restart NameNode with the "-rollingUpgrade started" option if a rolling upgrade is already started; or restart NameNode with the "-upgrade" option to start a new upgrade.

Ideally we should start a rolling upgrade of all components. Currently you simply cannot upgrade your HDFS without hacking stuff (e.g. no cliOverrides to add -upgrade or similar)

Edit from the past: At least the upgrade 3.3.4 -> 3.3.6 worked

The text was updated successfully, but these errors were encountered:

nightkr · 2024-07-30T13:00:22Z

So, looking at https://hadoop.apache.org/docs/r3.4.0/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html and testing locally, it looks like the sequence is roughly:

Run hdfs dfsadmin -rollingUpgrade prepare, this uses Hadoop RPC and can be executed from anywhere
Poll for this to complete, upstream docs suggest hdfs dfsadmin -rollingUpgrade query but this is not suitable for machine consumption (unparseable output, status code is constant), can be queried over JMX and REST as curl $NAMENODE_HTTP_URL/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo | jq '.beans[0].RollingUpgradeStatus.createdRollbackImages', but we could also (maybe) use the internal Java API
Upgrade JournalNodes, wait for upgrade to complete
Upgrade NameNodes with -rollingUpgrade started
Upgrade DataNodes
Run hdfs dfsadmin -rollingUpgrade finalize, uses Hadoop RPC and can be executed from anywhere
Restart NameNodes without -rollingUpgrade flag

We need to detect whether to enter "upgrade mode", we could do that by storing a .status.deployedProductVersion in the CRD, which we set if unset (for new deployments) or once step 6 is complete. Then we're in upgrade mode if .status.deployedProductVersion != .spec.image.productVersion.

Steps 3-5 could be done by adding a check to the end of the STS apply: if in_rolling_upgrade && !ready { exit_reconcile() } where ready = .status.generation == .status.observedGeneration && .status.availableReplicas == .status.updatedReplicas == .spec.replicas.

Step 7 would be simple enough, happens by leaving "upgrade mode".

Steps 1/2/6 are the big question marks. We could run them from the operator container, exec into an existing namenode Pod, or spawn a dedicated Job. Generally running from the operator seems like a poor idea, both because of needing to bundle a HDFS client+JVM and because the operators don't have Kerberos identities (still need to look into how JMX is affected by this too?). Running as a Job means that we don't rely on picking a single "admin namenode", but creates another asynchronous lifecycle for us to manage.

nightkr · 2024-07-30T13:08:33Z

Another MVPier option would be to only add an override to do steps 3-5, leaving the dfsadmin steps (1/2/6) to be taken manually.

NickLarsenNZ · 2024-07-30T15:30:37Z

Was hoping it was as simple as an init container, but it looks like there is some choreography involved (with the "wait for" steps.

I think there should be a "do it for me automatically option", but if there is some clear risk to that, then it should be opt in (eg: demos can opt-in, customers might be more cautious).

Can something in stackablectl here help with said choreography to make the manual steps less of a burden?

nightkr · 2024-07-30T15:44:39Z

I think there should be a "do it for me automatically option", but if there is some clear risk to that, then it should be opt in (eg: demos can opt-in, customers might be more cautious).

Ultimately, all database upgrades (which is what this is) are risky. I agree that it might make sense to have a safeguard, but we should probably think about that as a platform-wide decision then.

Can something in stackablectl here help with said choreography to make the manual steps less of a burden?

I don't think it'd make much sense. 3-5 comes down to updating the StatefulSets in order, which is managed entirely by the operator. 1/2/6 wouldn't be easier for stackablectl to do than for the operator.

Stackablectl also generally isn't really responsible for modifying stacklets at the moment, and I'd be sad to see that change.

NickLarsenNZ · 2024-07-30T19:19:59Z

Ah yeah, that makes it more clear.

Ultimately, all database upgrades (which is what this is) are risky.

Sure, but operational tasks can be codified (assuming there are checks at each step to prove it is safe to proceed with the next) and IMO this is what Operators are for. The problem could probably be modeled sufficiently with a Finite State Machine.

Maybe it is a tall order to codify operations like this, but this should be the ultimate platform-wide goal.

nightkr · 2024-07-31T11:47:42Z

Sure, but operational tasks can be codified (assuming there are checks at each step to prove it is safe to proceed with the next) and IMO this is what Operators are for. The problem could probably be modeled sufficiently with a Finite State Machine.

I mean, yeah. I agree that I'd like to have as much as possible managed by the operator. I'm just not sure HDFS is special enough to warrant its own rules for when upgrades should be allowed.

lfrancke · 2024-09-02T11:56:12Z

Do we have documentation for this? If so please link it here, if not, why not?

And can you please include a snippet that we can use for the release notes for this?

nightkr · 2024-09-04T07:40:25Z

Do we have documentation for this? If so please link it here, if not, why not?

The docs are at https://docs.stackable.tech/home/nightly/hdfs/usage-guide/upgrading

And can you please include a snippet that we can use for the release notes for this?

I suppose, "- The Stackable Operator for HDFS now supports upgrading existing HDFS installations" or something like that.

lfrancke · 2024-09-04T13:32:25Z

Is the functionality specific to 3.3 -> 3.4?
And we should include a sentence about this requiring manual work.

nightkr · 2024-10-16T07:39:49Z

Is the functionality specific to 3.3 -> 3.4?

No, the mechanism is generic. One caveat is that it currently takes the pessimistic approach of applying it to any upgrade, so 3.3.4 -> 3.3.6 would also trigger it.

And we should include a sentence about this requiring manual work.

Yeah that's a good point. Hm.

nightkr · 2024-10-16T07:42:47Z

The Stackable Operator for HDFS now supports upgrading existing HDFS installations. This process requires some manual intervention, however.

sbernauer mentioned this issue Jul 20, 2023

Let operators pick stackable version automatically stackabletech/issues#404

Closed

lfrancke added this to Stackable End-to-End Coordination Jul 26, 2023

lfrancke moved this to Next in Stackable End-to-End Coordination Jul 26, 2023

lfrancke added the scheduled-for/2023-11 label Jul 26, 2023

lfrancke removed the scheduled-for/2023-11 label Oct 18, 2023

lfrancke moved this from Next to Next + 1 in Stackable End-to-End Coordination Nov 8, 2023

lfrancke moved this from Next + 1 to In Progress in Stackable End-to-End Coordination Nov 8, 2023

lfrancke moved this from In Progress to Next + 1 in Stackable End-to-End Coordination Nov 8, 2023

lfrancke moved this from Next + 1 to Next in Stackable End-to-End Coordination Dec 8, 2023

lfrancke removed this from Stackable End-to-End Coordination Dec 18, 2023

lfrancke added epic size/XL labels Dec 18, 2023

lfrancke added this to Stackable End-to-End Coordination Jun 19, 2024

lfrancke moved this to Proposed in Stackable End-to-End Coordination Jun 19, 2024

xeniape mentioned this issue Jul 19, 2024

Check demos and upgrade from 24.3 to dev release stackabletech/demos#59

Closed

nightkr self-assigned this Jul 29, 2024

nightkr added this to Stackable Engineering Jul 29, 2024

nightkr moved this to Refinement: In Progress in Stackable Engineering Jul 29, 2024

lfrancke moved this from Proposed to In Refinement in Stackable End-to-End Coordination Jul 31, 2024

nightkr moved this from Refinement: In Progress to Development: In Progress in Stackable Engineering Aug 2, 2024

nightkr mentioned this issue Aug 2, 2024

Rolling HDFS upgrade #571

Merged

nightkr moved this from Development: In Progress to Development: Waiting for Review in Stackable Engineering Aug 5, 2024

NickLarsenNZ moved this from Development: Waiting for Review to Development: In Review in Stackable Engineering Aug 5, 2024

nightkr closed this as completed in #571 Aug 28, 2024

nightkr moved this from Development: In Review to Development: Done in Stackable Engineering Aug 28, 2024

lfrancke added the release/24.11.0 label Sep 2, 2024

lfrancke added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support [rolling] upgrade of HDFS #362

Support [rolling] upgrade of HDFS #362

sbernauer commented Jun 15, 2023 •

edited

Loading

nightkr commented Jul 30, 2024

nightkr commented Jul 30, 2024

NickLarsenNZ commented Jul 30, 2024

nightkr commented Jul 30, 2024

NickLarsenNZ commented Jul 30, 2024 •

edited

Loading

nightkr commented Jul 31, 2024

lfrancke commented Sep 2, 2024

nightkr commented Sep 4, 2024

lfrancke commented Sep 4, 2024

nightkr commented Oct 16, 2024

nightkr commented Oct 16, 2024

Support [rolling] upgrade of HDFS #362

Support [rolling] upgrade of HDFS #362

Comments

sbernauer commented Jun 15, 2023 • edited Loading

nightkr commented Jul 30, 2024

nightkr commented Jul 30, 2024

NickLarsenNZ commented Jul 30, 2024

nightkr commented Jul 30, 2024

NickLarsenNZ commented Jul 30, 2024 • edited Loading

nightkr commented Jul 31, 2024

lfrancke commented Sep 2, 2024

nightkr commented Sep 4, 2024

lfrancke commented Sep 4, 2024

nightkr commented Oct 16, 2024

nightkr commented Oct 16, 2024

sbernauer commented Jun 15, 2023 •

edited

Loading

NickLarsenNZ commented Jul 30, 2024 •

edited

Loading