Warning: The approach described here is unnecessarily complex, because it distributes one task over several classes. Simply use the TableReader
and then import by iterating over the DataRow
array in a POST service. This creates one linear piece of code. Easier to write, easier to read. Subject identification can be controlled with PHP functions.
API reference
This section describes how to import data organized in columns and rows. Topincs currently supports CSV, Excel and Microsoft Access as source formats. We use the term table for the data held in a csv file or Excel file. If you export an Excel sheet into a CSV file, both hold the same table.
Hint: It is possible to create invalid instance data in the topic map. It is the responsibility of the administrator to be aware of this and take care of it.
We consider a table to be composed of named columns and rows holding statements about any number of subjects. We do not make any assumptions on how the statements about one subject are distributed over the table(s). In most cases statements on one subject are within one row, but sometimes they are distributed over more than one row. When the data comes from one relational database system, statements on one subject are usually distributed over more than one table (otherwise there would be no need for foreign keys).
Experience has shown that the most common use case is a table held in an excel sheet. We make the following helpful distinction:
Primary subjects are generally easier to identify than secondary subjects. Many times they come with an usuable identifier. In end user administered data secondary subjects are many times referred to by their name in natural language. In data that stems from other computational systems most subjects have some sort of strong identification. Primary subjects share a resemblance with movement data (high frequency, little to no referrals), while secondary subjects are similar to master data (low frequency, many referrals).
In order to actually import a table you need to do the following:
At this point we have uploaded a data file into the store and we want to read the data. In our simple example we import the employment information of a few people. Our topic types are Person, Company, and Employment. Our schema in this example connects the Employment to exactly one person (employee) and one company (employer). Companies and people can be associated with any number of employments. Our CSV import looks like this:
"First name", "Date of birth", "Company", "Position", "Entry date", "Exit date"
"Joel", "1980-01-01", "Cola Coca, Inc", "Manager", "2000-01-01", "2003-05-31"
"Mary", "1990-01-01", "Orange, Inc", "President", "2002-01-01", "2009-01-31"
"Huey", "1990-03-01", "Orange, Inc", "Vice-President", "2005-01-01", "2012-01-31"
"Mary", "1990-01-01", "Nova, Inc", "Vice-President", "1999-05-01", "2001-12-31"
Best practise is to create a domain class for the data file topic type and do everthing in a method import
returning the report tobject. To persist the import results in the database has proven very useful, in particular if users are performing imports independently. The computational component of the service is then as easy as in the following code example.
<?php
require_once("domain/EmploymentFile.php");
$file = $p->get("file");
$report = $file->import();
redirect($report->href());
The code in the domain class is still quite generic. It reads the data, extracts the subjects and then imports them. The main work is done in the Extractors described in the next section.
<?php
require_once("api/import/TableReader.php");
require_once("api/import/Importer.php");
require_once("EmploymentExtractor.php");
require_once("CompanyExtractor.php");
require_once("PersonExtractor.php");
class EmploymentFile extends Tobject {
private function read() {
$options = ["rename" => ["Date of birth" => "dob"]];
return TableReader::read_csv($this->file()->path, $options);
}
function import() {
$start = new DateTime();
$results = [];
$errors = [];
$table = $this->read();
// $table holds an array of PHP objects of type DataRow.
// The data is conceptually a m x n matrix of strings.
// m is the number of rows, n the number of columns.
// Extracting is first. Order might be sigificant.
$companies = CompanyExtractor::extract($table, $errors);
$people = PersonExtractor::extract($table, $errors);
$employments = EmploymentExtractor::extract($table, $errors);
// Importing is second. Order is sigificant!
Importer::import($companies, $results, $errors,["duplicates" => "ignore"]);
Importer::import($people, $results, $errors,["duplicates" => "ignore"]);
Importer::import($exployments,$results, $errors,["duplicates" => "ignore"]);
// This is optional, but recommended when users import.
// Very useful for the admin in case of problems.
// You need to model the report topic type yourself.
return Tobject::make("id:9284")
->set_start($start)
->set_end(new DateTime())
->set_import_file($this)
->add_all_errors($errors)
->add_all_results($results);
}
}
Tobject::register("EmploymentFile", "id:2288");
API reference
API reference
API reference
The extractors play the biggest role in the Topincs import framework. Their main purpose is to identify subjects and collect what statements should be added to the topic. Addiitonally they clean up strings and convert them to match the datatype of the occurrence type. They are the guardians of form, which is indispensible if you want to perform computations on the data.
An extractor implements two methods: si
and convert
. Both are given a PHP object of type DataRow
.
si
computes a subject identifier for the import subject. If
it is starting with a ~, it will not be persisted. If si
returns
null, no subject of this type will be created for this data row. Implementing si
is optional but necessary, if an import is repeated with different data set.
convert
transfers information from the data row to the import subject.
<?php
require_once("api/import/Extractor.php");
abstract class PersonExtractor extends Extractor {
// The serialization name of the topic type.
const TYPE = "person";
static function si(DataRow $row) : ?string {
$key = sprintf("%s|%s", $row->mand("first name"), $row->mand("dob"));
return sha1(strttolower($key));
}
static function convert(DataRow $row, ImportSubject $subject) : void {
$subject
->set("name", $row->mand("first name"))
->set("date-of-birth", $row->mand("dob", "PersonExtractor::date"));
}
static function date($value) {
return DateTime::createFromFormat("Y-m-d", $value);
}
}
<?php
require_once("api/import/Extractor.php");
require_once("PersonExtractor.php");
abstract class EmploymentExtractor extends Extractor {
const TYPE = "employment";
static function convert(DataRow $row, ImportSubject $subject) : void {
$subject
->set("employee", $row->get_import_subject("person"))
->set("employer", $row->get_import_subject("company"))
->set("position", $row->mand("positon"))
->set("start", $row->mand("entry date", "PersonExtractor::date"))
->set("start", $row->opt("exit date", "PersonExtractor::date"));
}
}
After extracting we have all data assembled in a way that is in line with our schema. So far it is only in memory. Now we need to persist it in the store. For this we need to hand over the import subjects to the importer. Before the import subjects are actually imported, some checks are performed:
Errors related to one import subject only affect the import of other subjects, if there is an association between them.
Generally it is good practise to use one or more subject identifiers per topic. If a mapping of import subjects onto the persisted set of subject identifiers is not possible, using a Matcher
avoids duplication. Duplication means creating a new topic for an import subject that is already represented in the store.
A matcher performs the task of aligning existing information in the store with the intermediate representation of import subjects in memory.
hash
computes a string value from
the relevant properties.
tobject
fetches the
relevant information from all existing topics of the
respective topic type.
API reference
<?php
require_once("api/import/Matcher.php");
class PersonMatcher extends Matcher {
// The serialization name of the topic type.
const TYPE = "person";
static function hash(array $data) : string {
return strtolower($data["name"] . $data["data-of-birth"]->format("Ymd"));
}
static function tobject(Tobject $tobject) : array {
return [
"name" => $tobject->get_name(),
"date-of-birth" => $tobject->get_date_of_birth()
];
}
}
This page cannot be displayed in your browser. Use Firefox, Opera, Safari, or Chrome instead.
Saving …