Skip to content

Feature: Import Data #217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 125 commits into from
Apr 11, 2022
Merged

Feature: Import Data #217

merged 125 commits into from
Apr 11, 2022

Conversation

benloh
Copy link
Collaborator

@benloh benloh commented Mar 3, 2022

Branch: dev-bl/import

Overview

This adds an Import Data feature to Net.Create.

This is a pretty significant feature that ended up touching ALL aspects of the app. A very thorough QA cycle will be needed.

To Test

To test, first export nodes and edges from an existing graph. (NOTE please create new exports. Do not use exported csv files created previously from the dev-bl/export branch, there were errors in the previous export routine.)

  1. git fetch && git checkout dev-bl/import
  2. ./nc.js --dataset=yourdataset
  3. Select "More..."
  4. Select "Import/Export"
  5. Click "Export Nodes"
  6. Click "Export Edges"

Next, try modifying the labels of the exported nodes, and reimport them.

  1. Open the exported nodes csv.
  2. Change the label to something else.
  3. Select "More..."
  4. Select "Import/Export"
  5. Click "Choose File" for nodes.
  6. The "Import" button should appear if the files are valid.
  7. Click "Import"

Next, create a new graph and try importing the nodes and edges.

  1. Open the exported nodes csv file.
  2. Change all the "ids" to "new"
  3. ctrl-c to quit Net.Create.
  4. ./nc.js --dataset=empty
  5. Select "More..."
  6. Select "Import/Export"
  7. Click "Choose File" for nodes.
  8. The "Import" button should appear if the files are valid.
  9. Click "Import"
  10. Your original dataset should be recreated

Other things to test:

  • Try adding new nodes/edges to an existing graph. Any matching ids will be replaced, any ids makred "new" will be newly created.
  • Try adding ONLY nodes, or ONLY edges. You can add them in any order one at a time.
  • Open Net.Create remotely. You should not be able to import data.
  • Open Net.Create remotely with ?admin=true. You should be able to import data.
  • Edit your existing template, add and set allowLoggedInUserToImport to true. Open Net.Create remotely without admin priviledges. Import is disabled. Log in. Import is now enabled.

Test error checking:

  • Try importing nodes with bad ids
  • Try importing edges with bad ids
  • Try importing edges that link to bad node ids
  • Try importing nodes/edges that have a missing header field
  • Try importing nodes/edges that have missing fields in the rows
  • Try importing nodes/edges that have linefeed characters in side of text
  • Try import nodes/edges that have other unexpected characters -- please document any bugs that emerge.

Import Feature

Net.Create can import nodes and edges via separate .csv files.

The easiest way to set up a CSV file for import is to first export a few nodes/edges from your existing project. They key is to set up the Template first with the appropriate headers. Then you can export and modify the csv files, then reimport them.

For both nodes and edges, you should be able to:

  • Add new nodes/edges
  • Partially replace existing nodes/edges
  • Partially replace existing nodes/edges AND add new nodes/edges in the same file

Replacing Existing Nodes and Edges

During an import, Net.Create uses node and edge ids to match imported data to existing data.

If the id does not match an existing node or edge id, the app will output an error message listing the problem id and the row of the id. "row" refers to the line number in the csv file. Line 1 is the header. Line 2 would be the first data row.

Adding New Nodes/Edges

To add a new node or edge, you need to use an id of "new". For example, to add "Tacitus" and "Granicus", you would define an import csv file with two rows where the ID is set to "new".

import_node.csv

ID,Label,Type,Notes,Info,Degrees,Created,Last Updated
new,"Tacitus",
new,"Granicus"

You can mix "new" and replacements, e.g. this will replace existing node 1 with "Claudius", and add "Tacitus" and "Granicus".

ID,Label,Type,Notes,Info,Degrees,Created,Last Updated
1, "Claudius",
new,"Tacitus",
new,"Granicus"

The "new" keyword is case-insensitive. e.g. "NEW", "New", "new", and "nEw" all work.

Defining Import Fields

The fields required by the graph for import are defined by the template. Currently any fields that are defined in the template and NOT HIDDEN will be required when you import. E.g. if you define 6 fields in the template, then the import file MUST have 6 fields with headers that match the exportLabel fields defined in the template, or the importer will complain about missing fields.

Any fields that are marked in the template as hidden will not be exported, nor will they be required for import.

While all non-hidden headers are required in the csv file, you can skip fields in the node/edge data rows. For example, when importing a node, you can just specify "id" and "label" and skip the other fields, e.g.:

ID,Label,Type,Notes,Info,Degrees,Created,Last Updated
1, "Claudius",
new,"Tacitus",,"He's the man!"

It's possible we might want to relax this and just require only the main fields (e.g. id and label with nodes, source target and id with edges). This will give you more flexibility when importing. On the other hand, you can also leave the fields empty, so long as you have the right headers in the csv.

The fields required in the import/export headers are defined via the exportLabel property in templates. The exportLabel will map the headers to built-in Net.Create fields.

You can rename exportLabel fields to match your the fields required/used by your external graph application. For example, if your graph application expects type to be labeled as OPTIONS, you can set the node type exportLabel to OPTIONS. The labels are case sensitive.

Encoding / Special Characters

Since we're using .csv as the export/import data format, there are a few considerations for encoding:

  • Carriage returns are allowed inside of quotes.
  • Double Quotes need to be encoded. If you need to use double quotes, use two of them next to each other. Excel should automatically encode quotes as a double quote when exporting to a CSV file. NOTE: An extraneous double quote will probably generate a bad header error during import. Here's an example of a valid and invalid use of quotes:
ID,Label,Type,Notes,Info,Degrees,Created,Last Updated

// valid -- note the correct use of two "" around Egyptians
7,"Alexandria","Place","Alexandria is one of two places where ""Egyptians"" introduced the plague.","","",""

// invalid -- the single quote around Egyptians will cause import to fail, usually with a bad header message
7,"Alexandria","Place","Alexandria is one of two places where "Egyptians" introduced the plague.","","",""

There are probably other exceptions that we'll need to add validation for, especially control characters.

Error Checking

The app provides two levels of validation during import. When you select a file for import, the system will:

  1. Check the file's headers to make sure they match the expected headers as defined in the template. If they do not, an error message is displayed.
  2. Read the data and run simple validation on ids. If it encounters an error, it will display the errors. You can then fix and/or select a new file to import.

Errors Caught:

  • Headers in import csv do not match headers defined in template. (Hide or remove the field in the template to make it not required for import)
  • Node or edge uses an invalid id
  • Edge refers to nonexistent source or target node ids

Errors Not Caught:

  • If your data values do not match up to the header values, the system will blindly import the values and you might end up with data matched to the wrong header/field.
  • The row number that an error is reported on can be thrown off if there are carriage returns in quoted text.

Troubleshooting

Oftentimes, bad encoding errors (e.g. mismatched double quotes, extraneous commas, extraneous carriage returns) will result in a Header error. If you see an error about mismatched headers and you know your headers are...

  1. well defined
  2. match the non-hidden headers in the template
    ...then you might try the following to troubleshoot:
  • Open the csv in a text editor or Excel to make sure there aren't stray characters.
  • Open the csv in Excel to make sure each record gets is own row. If not, then you might have some stray carriage returns (usually appearing outside of quoted text).
  • Try deleting everything but the header and one line of data and see if that imports. If that imports, then the culprit is somewhere within your data encoding.

Import Report

After importing data, the app will display a list of nodes/edges that have been replaced as well as a count of all the nodes/edges that were added or replaced.

Import Permissions

Roles: Admins vs Non-admins

Admins are always allowed to import data.

Non-admin users are normally not allowed to import data. This pull request adds a new option to the Templates to allow logged in users to import data. If the template setting allowLoggedInUserToImport is set to true, then logged in users will be able to import data. By default allowLoggedInUserToImport is false.

Edit States

Since importing modifies the database, during an import, editing the Template and editing individual Nodes and Edges is locked out. Conversely, if someone is editing a Node, Edge, or Template, Importing is locked out. This prevents accidental overwriting of data.

If you navigate away from the Import panel, the import is cancelled. This might be a little surprising and awkward so we might want to revisit this. But there wasn't another clear way to "Cancel" the import lockout.

Standalone Mode

Importing is also disabled in standalone mode, since you are not allowed to modify the database.

Import Backups

Every time you click the "Import" button to import data (nodes or edges), the Net.Create server will make a backup of the current database into the runtime/backups folder before executing the import. The backup file will be named the same as the open database file with a timestamp appended.

If you are running this on a server or using nc-multiplex you'll want to periodically monitor the runtime/backups folder to make sure it does not grow too large. You'll want to periodically clear out the backup files.

If you need to restore a backup, you can copy it to runtime/ and just open it directly via ./nc.js --dataset=xxx call, or rename the database file back to the original name, copy it over the original in runtime/ and open that via ./nc.js.

Admin Tools

"Force Unlock All"

Template editing, importing data, and node and edge editing are all mutually exclusive actions: If someone on the network is doing one of those activities, others are prevented from doing the same. (The one exception is that if you are editing a node or edge, others are only prevented from editing the same node or edge and editing the template or importing, but they can still edit other nodes and edges.) When the edit/import is complete the lock on editing should be released.

Every once in a while, the release message is lost and the edit lock remains in place. If this happens, an administrator can go to the More > Import / Export panel and click the "Force Unlock All" button to release the edit lock and re-enable template editing, importing, and node and edge editing.

WARNING: Use this with utmost caution! If someone is actively editing or importing, you can delete their work, or even worse, corrupt the database!


Other Changes

revision update bug

There was a bug where the revision field was either not updated at all, or was being updated only once per session (after a reload), instead of being updated with every database update. It now properly updates every time you edit and save a node or edge.

TOML template file update

The default toml template and schema have been updated. You'll want to review or update your existing template files.

"weight"-ready

While the UI (EdgeEditor) does not currently support it, the app logic now calculates edge line sizes by summing up edge weight values. e.g. if two nodes are connected by three edges with a weight of 1, 2, and 4, the size of the edge will be 6.

weight defaults to 1. It can support values smaller than 1.

weight is not currently settable via the EdgeEditor UI. EdgeTable does not yet display weight either. It is also not saved in the database.

Optimization

We've done a little bit of optimizing the d3 render loop -- node sizes are now calculated before the render and done more efficiently.

Data Model Refinement

This is mostly under the hood stuff, but we have now more formally separated the raw network data from the rendered d3 data. This should make it easier to do future updates.

Force Updates

You might notice graphs look very different. With the refinement of the data model, we updated the way d3 is rendering forces. Hopefully this is an improvement, but we will probably have to do some exploration to make sure all graphs look better.

Node Table ID Sorting

In debug mode, NodeTables show IDs. You can now sort by the ID.

benloh added 30 commits March 2, 2022 09:18
…d3-processed edge objectgs where `source` and `target` have been transformed form `id` to node objects.
…rm and distinguish it from data directly used and modified by D3.
@benloh
Copy link
Collaborator Author

benloh commented Mar 25, 2022

@jdanish @kalanicraig I've completely rewritten the import UI and validation. Now as soon as you select a file, we do both header checking and id validation and report the results immediately. This has a number of implications:

  • if you are trying to import edges that reference novel node ids in a node import file, you will first have to select the node file to get it to load and validate, then the edges file. If you load the edges file first, the system will not find the novel node ids.
  • If you define new node ids using the "new" keyword, you will not be able to define edges that link them -- since you have no ids to define the links. I suppose in a future version we can allow linking by source/target labels, but that will add another layer of complexity to an already very complex import system.

Please give it a whirl! Hopefully it's an improvement.

@jdanish
Copy link
Collaborator

jdanish commented Mar 25, 2022

Initial reaction is "sweet!!"

Will bang on it, but looks good to me so far!!

… to edges.source.label and edge.target.label.
@benloh
Copy link
Collaborator Author

benloh commented Mar 25, 2022

EdgeTable sorting was broken due to NCDATA changes. Sorting on all fields should now work again.

@benloh
Copy link
Collaborator Author

benloh commented Mar 25, 2022

To Do

  • Nodes are importing with bad dates
  • Reconcile _default.template.toml with template-schema.js -- auto-generate a template-schema.template.toml file on build? => Default to default.template.toml, fall back to template-schema.js #227
  • Add help description to hidden checkbox when template editing
  • Do not export parameters that are hidden
  • Require explicit new ID to generate a new node or edge
  • updated and revision are not quite working as expected for both node and edges -- disconnect between FILTEREDD3DATA and NCDATA?
  • Allow requiring created, updated, and revision data for import and export.
  • Add requiredForImportExport flag Moved to requiredForImport netcreate-itest#32
  • Backup db before import
  • Allow importing new ids
  • Update highest ID after import.
  • Report highest ID on import/export tab.
  • Node hover info is missing date info
  • Edge table source/target shows IDs, not labels!
  • Doc: Monitor backup/runtime folder! Esp when running on a server/nc-multiplex. Each import triggers a backup.
  • EdgeFilter by source/target is broken because it's using id not label
  • Verify nc-multiplex still works
  • Verify standalone mode still works
  • Update wiki when merging

…s when comparing filter values. (Filtering by source/target did not work because they were using 'id' not the 'label')
@benloh
Copy link
Collaborator Author

benloh commented Mar 25, 2022

Edge filtering functionality is now restored. NCDATA changes resulted in filters operating on ids rather than labels.

benloh added 2 commits March 26, 2022 10:31
… highlight vs unhighlighted is more pronounced. Otherwise it was never clear which lines were considered unhighlighted, especially for thicker lines.
@benloh benloh marked this pull request as ready for review March 29, 2022 15:21
@benloh
Copy link
Collaborator Author

benloh commented Mar 29, 2022

QA

  • standalone mode works
  • nc-multiplex works
  • requiredForImportExport will be implemented next round -- implementing it properly gets very complex as we have to deal with pre-existing fields, field removal vs hiding, and arbitrary field import.

@jdanish
Copy link
Collaborator

jdanish commented Mar 29, 2022

Awesome. Just to be really clear, the comment about "requiredForImportExport" means that in the current version (until we get more funding), if you hide a field via template, it will not export or import. So, anything you want in the graph should be listed as visible for that process.

I think that's fine and mostly users treat them as identical for now, just want to make sure we know. Thanks!

@benloh
Copy link
Collaborator Author

benloh commented Mar 29, 2022

Yeah. I had started to implement it last night, but as I sketched things out, I realized it was WAY to complicated if we want to properly handle all the cases. See netcreateorg#32

Think of "hidden" as a way to temporarily turn off fields that you might want to restore later. If you don't need a field at all, you can just not include them in the template.

@benloh benloh mentioned this pull request Mar 31, 2022
@benloh
Copy link
Collaborator Author

benloh commented Apr 11, 2022

Kalani wrote: I’ve now tested import/export, filtering, standalone and template changes on 4-5 different networks, including a Japanese-language network and a network with markup in the notes, and not had problems with any of them. I’d guess there are still some bugs floating around, but it’s probably time to merge into dev and also maybe to release nc-multiplex.

@benloh benloh merged commit 5d2359d into dev Apr 11, 2022
@benloh benloh deleted the dev-bl/import branch May 28, 2023 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants