odaesa

Personal tools

The data schema

From W

Jump to: navigation, search

global links:

  1. new NPD admin system (test version)
  2. new NPD user system (test version)
  3. Reports - weekly reports. setup and test reports, and others
  4. Meetings - information on past and future meetings
  5. Project plan - the original, and current updates
  6. the completed dissertation
  7. Original NPD - info and links
  8. Web services links


UPDATE - the new database schema with content is available as a backup:

npd.sql


Contents

Analysis

Of the data schema and suggested/required changes:

mytsummeta

Fields: GeneId, GeneName, GeneNameOthers, TaxnNames, Keywords, TempID

Used in: gene_name reads GeneName, TaxnNames

GeneId is not auto-generated on the database, but should be auto-created by the system.

TempID appears to be superfluous, given the existence of GeneID.

GeneNameOthers also appears to be superfluous given that there is a genenametaxn table.

Keywords is complicated. See the later comments on the keywords issue.


mytsumgenenametaxn

Fields: GeneID, GeneName, Preferred, TaxnNames

Used in: gene_name reads GeneName, TaxnNames

GeneID should be added to this table when auto created at least once, as at least one record should exist per gene, plus more for each alt. name.


mytsumgenemaploctn

Fields: GeneID, GeneMapLoctn, TaxnName

Used in: gene_name reads GeneMapLoctn, TaxnName


mytsumprotseqannotn

Fields: GeneID, ProtSeqID, PrimaryKey, TaxnName, MwPredict, PIPredict, MwActual, PIActual, AANo, Domains, Accession, Links

Used in: protein_info reads ProtSeqID, TaxnName, MwPredict, PIPredict, MwActual, PIActual, AANo, Links

Domains may be superfluous.

Accession may be superfluous, see later comments on Links.

PrimaryKey is built out of ProtSeqID, which is a rolling number. PrimaryKey may be superfluous.

TaxnName may be superfluous.


mytsumprotannotn

Fields: GeneID, SeqID, Start, End, Domain, Annot_Desc

Used in: protein_info reads Annot_Desc, Start, End, Domain

SeqID is not auto-created by the database, but should be auto-created by the system


mytsumprotsubnlocal

Fields: GeneID, TableID, RecordID, CellStageParent, CellStageName, CellStageDesc, CellStageDetail, PrimaryKey, Links

Used in: localisation reads CellStageName, CellStageDesc, CellStageDetail, Links

TableID and PrimaryKey may be superfluous. See later comments on Links.

CellStageParent is sometimes blank, sometimes not. Could be available from Gene Ontology. But may be superfluous anyway.


mytsumgenebioproces

Fields: GeneID, TableID, GOID, GOTerm, Links

Used in: function reads GOTerm, Links

TableID may be superfluous. See later comments on Links.

GOID can be retrieved from Gene Ontology.


mytsumgenemolfnctn

Fields: GeneID, TableID, GOID, GOTerm, Links

Used in: function reads GOTerm, Links

TableID may be superfluous. See later comments on Links.

GOID can be retrieved from Gene Ontology.


mytsumgenefnctn

Fields: GeneID, TableID, GeneFnctn, Links

Used in: function reads GeneFnctn, Links

TableID may be superfluous. See later comments on Links.


mytsumgeneexpr

Fields: GeneID, TableID, ExptName, GeneExpr, PrimaryKey, Links

Used in: expression reads GeneExpr, ExptName, Links

TableID and PrimaryKey may be superfluous.

See later comments on Links.


mytsumgenerelated

Fields: GeneID, GeneName, TaxnName, PrimaryKey, Links

Pages: related_genes reads TaxnName, GeneName, Links

PrimaryKey may be superfluous.

See later comments on Links.


mytsumgenekeyword

Fields: ID, GeneID, Keyword

Pages: keywords reads Keyword

See later comments on Keywords.


mytsumhistory

Fields: GeneID, CreatedAt, ChangedAt

Tracks the created and altered dates for each GeneID. Is this superfluous? Could it be combined with mytsummeta?


mytsumdnaannotn

Fields: GeneID, SeqID, Start, End, Domain, Annot_Desc

This table is no longer relevant.


mytsumdnaseqannotn

Fields: GeneID, DNASeqID, PrimaryKey, Domains, Accession, Links, TaxnName

This table is no longer relevant.


mytsumgenecomp

Fields: GeneID, Keyword

Provides compartment location information. But this data comes from subcellular localisation.

So this table is probably superfluous.


mytsumxdblnkrecord

Fields: XDBID, XDBAcc, TableID, RecordID, Published, EvidenceCode, XDBDesc, Link, PrimaryKey, GOLink

A table of all links used elsewhere in the database. May be superfluous. See later comments re. Links.


Links comments

The following 7 tables all contain links:

mytsumprotseqannotn (pubmed, entrez, uniprot, other - plus Accession)

mytsumprotsubnlocal (pubmed)

mytsumgenebioproces (pubmed - plus Evidence Code)

mytsumgenemolfnctn (pubmed - plus Evidence Code)

mytsumgenefnctn (pubmed - plus Evidence Code)

mytsumgeneexpr (pubmed)

mytsumgenerelated (pubmed, entrez, uniprot, other)

These links are also all stored in mytsumxdblnkrecord, although this table does not actually appear to be used at any point by the current live NPD portion of the system. It may have had use in the original admin system, although it does not form part of the original specification either.

The mytsumxdblnkrecord table lists a link per record but also a GO link - though it is unclear when these differ. This must be investigated further.

It is because this table exists that the other tables above have TableID tags which may otherwise not be relevant - they are used to tag links to table records.

The overall concept of storing links separately would be beneficial. It would enable clearer listing of link type and multiple links, instead of having them as a list in one field as is the current situation. However, the current mytsumxdblnkrecord table would not be ideal for this either. It would require some changes.

Any such changes to any of the tables would entail an alteration to the user system. So it must be decided if these alterations can be supported and completed in time. Otherwise, it would probably be most simple to just drop the mytsumxdblnkrecord table, and the superfluous TableID fields from the other tables.

Either way, the links should not exist in both places - they should either be in a separate Links table or on individual tables as fields.


Keywords comments

The mytsummeta table includes a Keywords field. In it are the keywords that would have been attached to the Gene ID by the administrator on the old system.

However, there is also a mytsumgenekeyword table. This table holds the above mentioned keywords, but also contains a keyword record of many of the discrete fields of data stored against every Gene ID. This does offer an easy method for searching across a broad spectrum of data, but represents a huge percentage increase in the size of the database. Whilst this may not be problematic at present, it would be if the database grew significantly. Also, it is not clear if this is a great benefit, as searches across multiple tables can easily be accomplished at little computational cost.

It will be necessary to split the Keywords content into separate subsections - those that are represented elsewhere e.g. GeneIDs, Link IDs, etc, and those that are unique - e.g. those represented in the Keywords field of mytsummeta. These unique keywords should then exist either in a keywords table, or in the Keywords field of mytsummeta, but not both. If the keywords table is dismantled, this will represent some loss to search functionality, so it will be necessary to recreate the search functionality that is of value within the search mechanism - or to just dump it if it was never of any use.

A further comment on keywords and searching - when searching by compartment, those keywords would currently be available in three locations - the keyword table, the genecomp table, and the protsubnlocal table. This is not necessary. as long as the search function takes account of where to look for this data, it seems superfluous to store it in 3 separate locations.


Changes

There are some changes that are suggested based on the above comments. These changes would require further changes to the user system too, but would offer benefits to the way data is stored, and limit unnecessary redundancy.

Some changes are required (identified as such below) in order to allow the new admin system to function. Hopefully they would not require changes to the user system. This would enable use of the new system without further change to the old user system, though it would limit the improvements that could be made.

mytsummeta

drop Keywords

drop TempID

drop genenameothers

transfer in CreatedAt and AlteredAt from mytsumhistory

mytsumgenenametaxn

add autonum RecordID (required)

mytsumgenemaploctn

add autonum RecordID (required)

mytsumprotseqannotn

drop PrimaryKey

drop Links

drop Domains

alter Accession content to be a pointer to the relevant Link in the new Links table

mytsumprotannotn

add autonum RecordID (required)

mytsumprotsubnlocal

drop PrimaryKey

drop CellStageParent

Apply PRIMARY and auto_increment to RecordID (required)

mytsumgenebioproces

add autonum RecordID (required)

drop TableID

drop Links

mytsumgenemolfnctn

add autonum RecordID (required)

drop TableID

drop Links

mytsumgenefnctn

add autonum RecordID (required)

drop TableID

drop Links

mytsumgeneexpr

add autonum RecordID (required)

drop Table ID

drop PrimaryKey

drop Links

mytsumgenerelated

drop Links

drop PrimaryKey

add autonum RecordID (required)

mytsumgenekeyword

edit content so it reflects only the unique keywords

(if this could not be done, due to sticking to the old user system, add a "unique" field, identifying those keywords that are not copies from elsewhere in the db)

mytsumhistory

transfer CreatedAt and ChangedAt to mytsummeta

drop table

mytsumdnaannotn

drop table

mytsumdnaseqannotn

drop table

mytsumgenecomp

drop table

mytsumxdblnkrecord

drop PrimaryKey

perhaps drop GO Link, unless it can be shown to be needed over and above Link

what are XDBID, XDBAcc, XDBDesc? drop if unnecessary

change RecordID to RefID

change TableID to RefTable and change content to actual table names

add autonum RecordID

SQL functions

Attach here some files that contain the SQL commands to execute the above changes to the database