Coding Test for parsing config files from first principles.
From the Brief provided:
- Specify multiple configuration files.
- Latter files overriding earlier (ie. file order must be maintained).
- Dot-separated path (
"keyX.keyY"
) to do look-ups. Return value (single/section). - Expose a getter for the dotted notation, eg:
get("database.host")
. - Avoid config parsing libraries!!
- Gracefully reject invalid JSON files.
- Extensible for non-JSON files.
- Dockerize the Library.
- To show-off the type of developer I am:
- Solution design & planning.
- List assumptions & reasoning for my choices.
- Documentation (docs, code comments, commit messages).
- Workflow (CI, tests, concise commits).
-
Solution Proposal - Pros/Cons/Assumptions.
-
Overriding instead of Amalgamation ie.
file1 = {"a": [{"b": 1}, {"c": 1}]} file2 = {"a": [{"c": 2}, {"d": 2}]} # After consolidating files in order... expected = {"a": [{"c": 2}, {"d": 2}]} # Override. dont_exp = {"a": [{"b": 1}, {"c": 2}, {"d": 2}]} # Amalgamate.
-
Order is currently what is passed into the File Reading function.
- What is the expected scale of config files to read/parse/consolidate on
average?
- <10 - Batch reading into memory and chaining functions will be fine.
- 100's+ - Can tweak the design to scale horizontally (Stretch of creating a
Docker container + knowing the push to move to Services, sounds like
vertical scaling is not a temporary solution worth considering). eg.
- Decouple functionality into modules/Services.
- Message queues/pools to farm out to Workers.
- Store configs in a DB cluster for decoupled reads and de-dupe from hash checks on write.
- etc.
- Support for
.
within a key?-
The JSON Spec (RFC-7159) supports any unicode character (apart from an unrelated subset) in a string. Therefore, it is legal to have a period in a key (eg.
{"a.b": 1}
). Current requirements of the Problem don't state this, but if this was an ongoing Project, the options would be:- Added as a future requirement.
- Eventually raised as a Customer bug.
- Documented as a design/known issue.
Support can be added by using escaped version for a Full Stop (
U+002E
) and passing the dotted path as abytes
object to avoid Python encoding it back to a Full Stop. eg."a.b".split(".") # ['a', 'b'] "a\u002eb".split(".") # ['a', 'b'] r"a\u002eb".encode("utf-8").split(b".") # [b'a\\u002eb']
-
- If all supplied files are invalid JSON, what should be raised to the User?
- Original requirement is to gracefully reject files with invalid JSON.
- Returning: an "Empty" value (eg.
None
,""
,{}
,-1
, etc.) could be seen as a False-Positive result (un-configured/default/explicitly-configured) for the dotted-path looked up. - Logging is an option, but typically requires a prompt to get a User to investigate logs as part of Debugging.
- I would probably push for bubbling up the exceptions to the User in some way (Raw/wrapped exception, summary report, notification), so they have a prompt to action the problem on Their side.
- File ordering - Sequential order of the list given to File Reader acceptable?
Or:
- Alphabetical?
- File creation/modified date-time?
- Arbitrary? - (eg. filename/content hash, filesize, other).
- Random? - (99% positive that in this case, a "Known"/Predictable order is required).
- Is the exercise to:
- Parse the file content to a format to allow direct dot notation look-ups?
- Parse the file content to a common, nested format and translate the dot
notation to do a lookup against that common format?
- This is my interpretation based off the "Don't use parsing packages",
"Extendable to parsing different file types", "Don't parse content to
an
object
" requirements.
- This is my interpretation based off the "Don't use parsing packages",
"Extendable to parsing different file types", "Don't parse content to
an
Some quick thoughts based off the requirements:
- Separate the concerns and how to satisfy support for non-JSON config files::
- File reading (common).
- Parsing content (unique).
- Decide on a common format, to parse to, that supports nested Key-Value pairs.
- How to support different parsers?:
- Forked logic (
if/else
,match/case
(3.10+), dict/enum look-ups, etc) can be a messy way of adding a new option. Main problems are around enforcing a common interface, testing and plumbing the fork logic throughout the code. - Interfaces are cleaner due to the enforcement of a contract boundary that all implementations will follow. Testing & downstream code is simplified when they work to the Interface contract. New options are a drop-in change with minimal plumbing. However, it requires more upfront thought to design a good ("good enough") initial contract/definition. This satisfies the "Easily extensible to different file types" requirement.
- Forked logic (
- Consolidating contents down (common).
dict.update
satisfies the "Override values" requirement.
- Lookup (common).
- Translate dot notation to lookup format:
- String munging into a dict look-up format and use
eval()
eg.eval('parsed_dict.get("keyX", {}).get("keyY", {})')
. Quick to implement but risky due to: no direct validation of the evaluated string,eval()
is generally avoided due to the ease to inject malicious code (partially negated by the steps needed to get it into a dictionary lookup format), maintainability/"code-smell" concern. - Recursion to dig into the dictionary key by key. Potential performance concerns compared to the Python internals when doing a nested lookup. Maintainability concerns; like async code, a complex recursive function sometimes needs a few passes to full understand when seen with fresh eyes.
- Looping to dig into the dictionary key by key. Same concerns as Recursion, but easier to maintain due to readability.
- NOTE: Requirements explicitly prevent mapping the data to nested objects, to allow direct look-ups with dot notation!
- One for discussion, but I would also consider using:
[
types.SimpleNamespce
] (as done in StackOverflow: How to use dot notation for dict in python?) as a "Cheat" at an equivalent level of converting the JSON to nestedobject
's, and then using the dot notation directly.
- String munging into a dict look-up format and use
- Do the lookup.
- Translate dot notation to lookup format:
- Readability/Maintainability/Flexible-to-change over unmeasured optimisations
over CPU/Memory/Workflows.
- eg. Look-ups are blocked on: Reading/Parsing/Consolidation of all files, so potential increased memory use from batching logical steps (instead of using a generator/looping) feels like an acceptable trade-off.
- Potential Milestones:
- File Read/Parse/Consolidate + lookup is done in a single-pass.
- Requires re-Reading/Parsing/Consolidating of files to do alternative look-ups.
- File Read/Parse/Store/ return ID(s). Decoupled Lookup.
- One-time file operations for known files.
- App time is spent on: file look-ups, consolidation, querying.
- Out-of-Scope: Measured decoupling and horizontal scaling.
- File Read/Parse/Consolidate + lookup is done in a single-pass.
I realised that I got into my Own head over this coding exercise and made the following mistakes in my Original Solution Design & Implementation:
- Over Architecting ("Too Deep, Too Soon")
- Feature Creep.
- Early, unmeasured optimising & refactoring.
- Bulky "MVP" (Minimal Viable Product).
ie. too much wasted effort in the wrong direction.
I course corrected the above by retroactively doing a PoC:
- Designs: PoC Class Design, PoC Block Design, PoC Sequence Design.
- Code:
poc.py
,test_poc.py
This is what I should have started with originally!
The PoC Block Design is broken out, but the poc.py
has each of these
blocks as a function within on class, as seen in the PoC Class Design.
The PoC Sequence Design shows the call flow for an external User making a
request with the: get <dotted_path> <file1 [file2 ...]>
,
entrypoint. Python entry-points are exposed in the pyproject.toml
via the:
[project.scripts]
section. See: PEP-621, setuptoops: Entry Points.
plantUml design to solve the above problem (See: Solution Class Design):
Breaking the Problem into 3 parts:
- File Reading.
- Parsing file contents to a common nested format.
- Transform dotted-path and do the lookup.
NOTE: My original draft had File Reading and Parsing combined (Solution (Fat Parser) Class Design) so that they could be called in a loop to reduce memory footprint. However, this tight coupling of different responsibilities prevents _potential_re-usability, complicates testing and makes it more troublesome to add additional functionality (eg. swapping content parser out, store/fetch/de-dupe configs with a DB, etc).
NOTE: The current code is at the "Fat Parser" state before parking and moving to the PoC!!
Following is some more detail about each module.
The File Reader handles the common task of:
- Loop through files and read contents.
- Store contents in file-order for later parsing.
NOTE:
- Using
list
to maintain the ordering of the files for consolidation. - Using
pathlib.Path
since it simplifies file handling compared toos.path
. eg. Context Manager-aware (automatically close files on falling out of Context Manager scope), read/write functions use Context Manager in the background, OS-agnostic, pytest uses it for it's PyTest:tmp_path
fixture. - Decoupling responsibilities for strong contract boundaries at the expense of memory usage (compared to using a generator for Reading+Parsing).
The Parser handles the conversion of the file content into the common format.
- Using
dict
as the common format, due to it's support for:- Nested structures and Look-ups.
dict.update()
satisfies the overriding of a key's value by a latter file's contents.
The Reader handles the common task of:
- Translating a dotted-path to a
dict
lookup format. - Does the lookup against the consolidated parsed dictionary.
The following steps can also be executed via Taskfile commands, shown in brackets.:
- Suggest configuring your python virtual environment of choice (
go-task create-venv
,go-task activate-venv
). eg.
python -m venv .venv
source .venv/bin/activate
- Install package:
pip install .
(go-task install-deps
). - [OPTIONAL] Install optional dependencies:
pip install . [<group>]
, where<group>
is one of:lint
(go-task install-lint-deps
).test
(go-task install-test-deps
).build
(go-task install-build-deps
).
For the PoC:
- Code is under:
poc.py
. - Tests are under:
test_poc.py
. - To access the
get
entrypoint from CLI:- Build the Python package (
go-task build
) and install into a virtual env. get
will be added to thePATH
.- Help:
get -h
. - Run:
get <dotted_path> <file1> <file2> ...
.
- Help:
- Build the Python package (
For the original Solution:
- Code entrypoint is:
main.py
. - Tests are under: [
tests/
]. - To access the
solution_get
entrypoint from CLI:- Build the Python package (
go-task build
) and install into a virtual env. solution_get
will be added to thePATH
.- Help:
solution_get -h
. - Run:
solution_get <dotted_path> <file1> <file2> ...
.
- Help:
- Build the Python package (
- From repo root directory:
pytest
(go-task test
).
- Run all following commands at once with Taskfile:
go-task lint
. - pycodestyle:
pycodestyle .
. - flake8:
flake8 .
. - black:
black . --check --diff
. - mypy:
mypy .
.
- Github Build action:
- Matrix job against all supported Python versions (Currently explicitly 3.11 during development, to reduce wasted CI time/costs).
- Taskfile integration (Development/CI consistency).
- Steps: create venv, install dependencies, lint, test, report.
- JUnit Test Report integration.
Why add a Taskfile?
Using Taskfile to abstract tasks/actions/steps, that may be complex, to a simple command that can be used both locally and in CI. ie. consistent environments, consistent execution.
NOTE: I'm using Taskfile in this Project but make, cmake, tox,
package.json scripts
, etc are all fine choices for
language-specific/agnostic cross-platform alternatives.
Code: Taskfile.yml
.
The way I like to run a retrospective is as a "cathartic detox/venting" session. ie. Get things off your chest, but aim for constructive, actionable criticism, so that you don't repeat the same issues.
Obviously, I am aware that putting the follow publicly can be taken as reasons to avoid hiring. However, I hope it is viewed as someone who has the awareness to act on their own continuous development.
- Work on my discipline to produce a lean MVP and POC:
- (Excuse) I've got a long history of feature development on long-running Projects, where most decisions are made to aid development of known future work. Outside of Tools/Scripts creation, it's not often I've had a need to spin-up a clean, short-term Project that is requirements complete.
- I had the awareness to: stop, reassess and change-direction, to produce the PoC. However, I am annoyed at myself for not switching mental tracks from the start.
- ACTION: feedback loop of design refinement to break down a Problem + Solution Design to it's leanest point (this is the PoC that can be created to educate others), then plot a timeline of milestones back up to the feature rich design (which may or may not be the original fat Solution Design from the beginning of refining).
- Schedule time for requirements refinement:
- Why I didn't do this?
- Requirement assumptions/issues sometimes take time to become apparent after mulling the Problem over / designing / implementing for a while, that can leave it too late to go back for an informed discussion in a week task.
- Requirements refinement is more efficient face-to-face for an active discussion, however, my experience so-far is it is rare to gain the time as an Applicant in an Interviewers busy schedule.
- Email requirements refinement can be a slow process if there is back & forth, or blocked waiting on a response.
- Used past experience to make educated assumptions.
- ACTION: Treat coding tests as same as daily work. ie. Collate questions quickly and reach out anyway.
- Why I didn't do this?
- Succumbing to Feature Creep:
- Good: Highlighting Risk/Holes in both known and assumed requirements is part of refinement & exploratory implementation/testing, which can benefit the end Product.
- Bad: taking a "Just Do It!" attitude that detracts from the goal of the exercise/work is a loss of focus, despite implementing something that may be beneficial.
- In this case I think it would have been better to have taken any "code itches" I had (eg supporting escaped Full Stops in keys) and leaving it as a code snippet in the Requirement Queries section instead of implementing.
- For me; the time of lost focus was a trigger that helped me step back and
re-evaluate the over development of my original Solution. ie. back to the
first point on: "producing a lean MVP".
- ACTION: Discipline on stepping back from highlighting side issues vs. implementing.
- Not maintaining my coding skills:
- I've enjoyed the time away from coding with the family, but coming back I did feel rustier than I expected.
- ACTION: Working on a small, self-contained Project to dust-off the skills after a long period/holiday from coding.
- Time Management:
- Affected by knock-on effects of above initial poor direction lapsed
discipline. I should have spent more time on the additional features to
show off me as an Engineer, such as:
- Multi-target Docker container for building/running the library.
- Affected by knock-on effects of above initial poor direction lapsed
discipline. I should have spent more time on the additional features to
show off me as an Engineer, such as: