Hierarchy

RAMADDA User Guide

15.2 Hierarchy

The data model is hierarchical and follows a composite pattern, i.e., everything is an entry and all entries (except the top most one) have a parent group entry. (e.g., see: How to manipulate hierarchical information in flat relational database tables).

We want to be able to search within the hierarchy, e.g., show me all of the radar files under the "ccs2" group. To do this we encode the hierarchical structure within the identifier of the groups. The relational model looks like:

       name                         id                 parent_group_id
group: Top                          0                  --- 
group:     Comet                    0:0                0  
group:          ccs1                0:0:0              0:0
entry:               radar1         A                  0:0:0
group:          ccs2                0:0:1              0:0
entry:               radar2         B                  0:0:1
entry:               radar3         C                  0:0:1
group:          ccs3                D                  0:0

Non-Group Entrys have a globally unique identifier generated by RAMADDA. However, Groups have a special identifier that allow for searching under sub-groups. A Group identifier is defined as:

<parent Group identifier>:<sibling number>

For example, the top most group has identifier "0". Its children have identifiers:

0:0
0:1
0:2
...

The children groups of the 0:2 Group have identifiers:

0:2:0
0:2:1
0:2:2
...

By doing this we can make a search of the form:

    select * from entries where parent_group_id like '0:0:1%'

The problem with this approach is that the identifier of a group is based on its position within the hierarchy. When a group is moved to a new parent group we need to change its identifier. Kind of messy. If external applications have a reference to one of these groups based on its id then those references are bad.

Another problem with this approach is that the "like" search entails a full table scan.

One of the alternative approaches to encoding hierarchy in a relational model is to have an ancestor table that defines the complete hierarchy. e.g.,:

id                ancestor
0                 ---
1                 0
2 (parent=1)      1
2                 0
3 (parent=2)      2
3                 1
3                 0

This brings scalability issues. The size of this table would be average tree depth*number of entries. So for example, we have ~3 million level3 radar files with the hierarchy:

Top
	IDD
		Radar
			Level3
				Station
					Product
						Radar data1
						Radar data2

We would end up having 6*3 million entries in the ancestor table. If this was a relatively static data set then we could handle that but the data set is dynamic and we get 5 files/second coming into the repository and 5/second being removed.

One approach to dealing with the identifier issue is to have all entries/groups have a unique and permanent id along with a parent_id. To encode the hierarchy we then have another field, a hierarchy_id, which takes the form that we currently have. e.g.:

       name                         id    parent_group_id    hierarchy_id
group: Top                          A     ---                0
group:     Comet                    B     A                  0:0
group:          ccs1                C     B                  0:0:0
entry:               radar1         D     C                  0:0:0:0
group:          ccs2                E     B                  0:0:1
entry:               radar2         F     E                  0:0:1:0
entry:               radar3         G     E                  0:0:1:1
group:          ccs3                H     B                  0:0:2

We still have bookkeeping costs when things are added and moved in that the hierarchy_id needs to change. We still have search costs in that we do:

    select * from entries where hierarchy_id like '0:0:1%'

But, each entity has a unique and permanent id.