Models

The way the HBIM is setup, one may add potentially new data to the HBIM and improve on it. In this section, we will go through two similar exercises to improve the HBIM and it’s querying capabilities.

Example I: Transcripts of the documents #

In this example, we will directly retrieve the data from the database and modify it.

The HBIM database contains the documents in various format. However, searching through the documents is not possible since they are not computer readable. There are capable OCR programs that can read the documents with higher degree of accuracy.

In this example, we will use Tesseract to OCR. It is a open source program based on LSTM neural network. It is based on libtesseract written in C++. However, there are front-end for many programming language like Python. We may use any tooling or programming language to interact the database, irrespective of what the API is written in.

Setting up the database #

Taking a look at the database, we can see that the relation is defined as,

CREATE TABLE documents(
        id                SERIAL PRIMARY KEY,
        description       TEXT,
        filetype          CHAR(8),
        file              CHAR(20),
        comments          TEXT,
        top_left          POINT,
        bottom_right      POINT
);

There is no place to add the transcripts. Therefore, we will add a new attribute to add the transcript. If the document is several pages long, it makes sense to store the transcripts separated by pages. One long text for the transcript for the entire document may difficult to query in case there is a mistake in the OCR.

Here we will create a new relation to store the transcript of the documents.

CREATE TABLE transcripts (
	id        SERIAL PRIMARY KEY,
	texts     TEXT,
	doc_id    INT REFERENCES documents(id),
	page      INT
);

where the texts contains the transcript-ed text, doc_id is a foreign key to the documents relation and the page contains the page number in the document.

The database is now ready to be able to add new contents, which we do in the next section.

Setting up tesseract #

Tesseract can be used standalone. The package is available in most Linux distros and Mac. It can installed by the package manager, (for example in Mac)

brew install tesseract

Populating the database with transcript #

Advantages #

There is less overheat since the data is accessed directly.
Any programming language can be used, as long as they can interface with database management server.

Disadvantages #

Knowledge of the internal working of the database is required.
If the database has been significantly changed, then the API may need to be updated.