corteza/pkg/migrate/README.adoc

= Data Migration System

This package implements a graph-based data migration system.

== Algorithm

* read & prepare files in a provided directory.
A file's name represents a module handle, so make sure that they match up,
* import system users and take note of there references.
System users will be needed with each migration, so they should be done in step 1,
* initialize the graph based on the provided files and their corresponding modules.
The system can automatically determine dependencies based on their fields,
* remove any cycle from the graph, by splicing one of the cycle nodes.
A spliced node is only able to import it's data, but it can not manage it's dependencies, as they might not yet be known.
Dependencies are updated when it's parent node is being processed,
* determine "leaf" nodes; the nodes with no dependencies.
These can be imported immediately.
* import each leaf node with satisfied dependencies.
Since leaf nodes have no more dependencies, they can be imported in parallel (@todo),
* when the node is finished importing, provide it's mapping information to each one of it's parents, as they will need it.
Also update the list of leaf nodes with parent nodes that have satisfied dependencies.

== Migration Mapping

A big part of this system is the support for migration maps; ie. what field from the original source should map into what module and under what field.

====
Currently only simple conditions, such as `type=specialType` are supported.
====

=== Algorithm
* unmarshal the given `.map.json`
* for each entry of the given source:
** determine the used map based on the provided `where` field & the rows content
** based on the provided `map` entries, update/create buffers
* flush data

=== Example

.source.map.json
[source,json]
----
[
  {
    "where": "type=type1",

    "map": [
      {
        "from": "id",
        "to": "splice_1.original"
      },
      {
        "from": "id",
        "to": "splice_2.original"
      },
      {
        "from": "id",
        "to": "splice.id"
      },

      {
        "from": "field1",
        "to": "splice.customName"
      },
      {
        "from": "field2",
        "to": "splice_1.customName"
      },
      {
        "from": "field3",
        "to": "splice_2.customName"
      }
    ]
  }
]
----

== Joining Migration Sources

An important feature is the system's ability to construct a migration map from multiple migration sources.
For example; we want to populate a `User` module, that includes data from `User.csv` and `SysUser.csv`.

=== Algorithrm

* unmarshal the given `.join.json`
* for each migration node that defines a `.join.json`:
** determine all "joined" migration nodes that will be used in this join operation,
** create `{ field: { id: [ value, ... ] } }` object for each base migration node, based on joined nodes,
** when processing the migration node, respect the above mentioned object and include the specified data.


=== Example

.source.join.json

`.join.json` files define how multiple migration nodes should join into a single module.

The below example instructs, that the current module should be constructed from it self and `subMod`; based on the `SubModRef` and `subMod.Id` relation.
When creating a `.map.json` file, values from the join operation are available under the specified alias (`...->alias`).

[source,json]
----
{
  "SubModRef->smod": "subMod.Id"
}
----

.source.map.json
[source,json]
----
[
  {
    "map": [
      {
        "from": "Id",
        "to": "baseMod.Id"
      },

      {
        "from": "baseField1",
        "to": "baseMod.baseField1"
      },

      {
        "from": "smod.field1",
        "to": "baseMod.SubModField1"
      }
    ]
  }
]
----

It is also possible to define a join operation on multiple fields at the same time -- useful in cases where a unique PK is not available and must be constructed.
The following example uses `CreatedDate` and `CreatedById` fields as an index.

[source,json]
----
{
  "[CreatedDate,CreatedById]->smod": "subMod.[CreatedDate,CreatedById]"
}
----

== Value Mapping

The system allows us to map a specific value from the provided `.csv` file into a value used by the system.
For example; we can map `In Progress` into `in_progress`.
The mapping also supports a default value, by using the `*` wildcard.

=== Algorithrm

* unmarshal the given `.value.json`
* before applying a value for the given field, attempt to map the value
** if mapping is successful, use the mapped value,
** else if default value exists, use the default value,
** else use the original value.

=== Example

.source.values.json

The following value mapping maps `sys_status` field's values; the left one into the right one, with a default of `"new"` (`"*": "new"`).

[source,json]
----
{
  "sys_status": {
    "In Progress": "in_progress",
    "Send to QA": "qa_pending",
    "Submit Job": "qa_approved",
    "*": "new"
  }
}
----

The system also provides support for arbitrary mathematical expressions.
If you wish to perform an expression, prefix the mapped value with `=EVL=`; for example `=EVL=numFmt(cell, \"%.0f\")`.

Variables:
* current cell -- `cell`.

The following example will remove the decimal point from every `sys_rating` in the given source.

[source,json]
----
{
  "sys_rating": {
    "*": "=EVL=numFmt(cell, \"%.0f\")"
  }
}
----