The data schema
From W
global links:
- new NPD admin system (test version)
- new NPD user system (test version)
- Reports - weekly reports. setup and test reports, and others
- Meetings - information on past and future meetings
- Project plan - the original, and current updates
- the completed dissertation
- Original NPD - info and links
- Web services links
UPDATE - the new database schema with content is available as a backup:
Contents |
Analysis
Of the data schema and suggested/required changes:
mytsummeta
Fields: GeneId, GeneName, GeneNameOthers, TaxnNames, Keywords, TempID
Used in: gene_name reads GeneName, TaxnNames
GeneId is not auto-generated on the database, but should be auto-created by the system.
TempID appears to be superfluous, given the existence of GeneID.
GeneNameOthers also appears to be superfluous given that there is a genenametaxn table.
Keywords is complicated. See the later comments on the keywords issue.
mytsumgenenametaxn
Fields: GeneID, GeneName, Preferred, TaxnNames
Used in: gene_name reads GeneName, TaxnNames
GeneID should be added to this table when auto created at least once, as at least one record should exist per gene, plus more for each alt. name.
mytsumgenemaploctn
Fields: GeneID, GeneMapLoctn, TaxnName
Used in: gene_name reads GeneMapLoctn, TaxnName
mytsumprotseqannotn
Fields: GeneID, ProtSeqID, PrimaryKey, TaxnName, MwPredict, PIPredict, MwActual, PIActual, AANo, Domains, Accession, Links
Used in: protein_info reads ProtSeqID, TaxnName, MwPredict, PIPredict, MwActual, PIActual, AANo, Links
Domains may be superfluous.
Accession may be superfluous, see later comments on Links.
PrimaryKey is built out of ProtSeqID, which is a rolling number. PrimaryKey may be superfluous.
TaxnName may be superfluous.
mytsumprotannotn
Fields: GeneID, SeqID, Start, End, Domain, Annot_Desc
Used in: protein_info reads Annot_Desc, Start, End, Domain
SeqID is not auto-created by the database, but should be auto-created by the system
mytsumprotsubnlocal
Fields: GeneID, TableID, RecordID, CellStageParent, CellStageName, CellStageDesc, CellStageDetail, PrimaryKey, Links
Used in: localisation reads CellStageName, CellStageDesc, CellStageDetail, Links
TableID and PrimaryKey may be superfluous. See later comments on Links.
CellStageParent is sometimes blank, sometimes not. Could be available from Gene Ontology. But may be superfluous anyway.
mytsumgenebioproces
Fields: GeneID, TableID, GOID, GOTerm, Links
Used in: function reads GOTerm, Links
TableID may be superfluous. See later comments on Links.
GOID can be retrieved from Gene Ontology.
mytsumgenemolfnctn
Fields: GeneID, TableID, GOID, GOTerm, Links
Used in: function reads GOTerm, Links
TableID may be superfluous. See later comments on Links.
GOID can be retrieved from Gene Ontology.
mytsumgenefnctn
Fields: GeneID, TableID, GeneFnctn, Links
Used in: function reads GeneFnctn, Links
TableID may be superfluous. See later comments on Links.
mytsumgeneexpr
Fields: GeneID, TableID, ExptName, GeneExpr, PrimaryKey, Links
Used in: expression reads GeneExpr, ExptName, Links
TableID and PrimaryKey may be superfluous.
See later comments on Links.
mytsumgenerelated
Fields: GeneID, GeneName, TaxnName, PrimaryKey, Links
Pages: related_genes reads TaxnName, GeneName, Links
PrimaryKey may be superfluous.
See later comments on Links.
mytsumgenekeyword
Fields: ID, GeneID, Keyword
Pages: keywords reads Keyword
See later comments on Keywords.
mytsumhistory
Fields: GeneID, CreatedAt, ChangedAt
Tracks the created and altered dates for each GeneID. Is this superfluous? Could it be combined with mytsummeta?
mytsumdnaannotn
Fields: GeneID, SeqID, Start, End, Domain, Annot_Desc
This table is no longer relevant.
mytsumdnaseqannotn
Fields: GeneID, DNASeqID, PrimaryKey, Domains, Accession, Links, TaxnName
This table is no longer relevant.
mytsumgenecomp
Fields: GeneID, Keyword
Provides compartment location information. But this data comes from subcellular localisation.
So this table is probably superfluous.
mytsumxdblnkrecord
Fields: XDBID, XDBAcc, TableID, RecordID, Published, EvidenceCode, XDBDesc, Link, PrimaryKey, GOLink
A table of all links used elsewhere in the database. May be superfluous. See later comments re. Links.
Links comments
The following 7 tables all contain links:
mytsumprotseqannotn (pubmed, entrez, uniprot, other - plus Accession)
mytsumprotsubnlocal (pubmed)
mytsumgenebioproces (pubmed - plus Evidence Code)
mytsumgenemolfnctn (pubmed - plus Evidence Code)
mytsumgenefnctn (pubmed - plus Evidence Code)
mytsumgeneexpr (pubmed)
mytsumgenerelated (pubmed, entrez, uniprot, other)
These links are also all stored in mytsumxdblnkrecord, although this table does not actually appear to be used at any point by the current live NPD portion of the system. It may have had use in the original admin system, although it does not form part of the original specification either.
The mytsumxdblnkrecord table lists a link per record but also a GO link - though it is unclear when these differ. This must be investigated further.
It is because this table exists that the other tables above have TableID tags which may otherwise not be relevant - they are used to tag links to table records.
The overall concept of storing links separately would be beneficial. It would enable clearer listing of link type and multiple links, instead of having them as a list in one field as is the current situation. However, the current mytsumxdblnkrecord table would not be ideal for this either. It would require some changes.
Any such changes to any of the tables would entail an alteration to the user system. So it must be decided if these alterations can be supported and completed in time. Otherwise, it would probably be most simple to just drop the mytsumxdblnkrecord table, and the superfluous TableID fields from the other tables.
Either way, the links should not exist in both places - they should either be in a separate Links table or on individual tables as fields.
Keywords comments
The mytsummeta table includes a Keywords field. In it are the keywords that would have been attached to the Gene ID by the administrator on the old system.
However, there is also a mytsumgenekeyword table. This table holds the above mentioned keywords, but also contains a keyword record of many of the discrete fields of data stored against every Gene ID. This does offer an easy method for searching across a broad spectrum of data, but represents a huge percentage increase in the size of the database. Whilst this may not be problematic at present, it would be if the database grew significantly. Also, it is not clear if this is a great benefit, as searches across multiple tables can easily be accomplished at little computational cost.
It will be necessary to split the Keywords content into separate subsections - those that are represented elsewhere e.g. GeneIDs, Link IDs, etc, and those that are unique - e.g. those represented in the Keywords field of mytsummeta. These unique keywords should then exist either in a keywords table, or in the Keywords field of mytsummeta, but not both. If the keywords table is dismantled, this will represent some loss to search functionality, so it will be necessary to recreate the search functionality that is of value within the search mechanism - or to just dump it if it was never of any use.
A further comment on keywords and searching - when searching by compartment, those keywords would currently be available in three locations - the keyword table, the genecomp table, and the protsubnlocal table. This is not necessary. as long as the search function takes account of where to look for this data, it seems superfluous to store it in 3 separate locations.
Changes
There are some changes that are suggested based on the above comments. These changes would require further changes to the user system too, but would offer benefits to the way data is stored, and limit unnecessary redundancy.
Some changes are required (identified as such below) in order to allow the new admin system to function. Hopefully they would not require changes to the user system. This would enable use of the new system without further change to the old user system, though it would limit the improvements that could be made.
mytsummeta
drop Keywords
drop TempID
drop genenameothers
transfer in CreatedAt and AlteredAt from mytsumhistory
mytsumgenenametaxn
add autonum RecordID (required)
mytsumgenemaploctn
add autonum RecordID (required)
mytsumprotseqannotn
drop PrimaryKey
drop Links
drop Domains
alter Accession content to be a pointer to the relevant Link in the new Links table
mytsumprotannotn
add autonum RecordID (required)
mytsumprotsubnlocal
drop PrimaryKey
drop CellStageParent
Apply PRIMARY and auto_increment to RecordID (required)
mytsumgenebioproces
add autonum RecordID (required)
drop TableID
drop Links
mytsumgenemolfnctn
add autonum RecordID (required)
drop TableID
drop Links
mytsumgenefnctn
add autonum RecordID (required)
drop TableID
drop Links
mytsumgeneexpr
add autonum RecordID (required)
drop Table ID
drop PrimaryKey
drop Links
mytsumgenerelated
drop Links
drop PrimaryKey
add autonum RecordID (required)
mytsumgenekeyword
edit content so it reflects only the unique keywords
(if this could not be done, due to sticking to the old user system, add a "unique" field, identifying those keywords that are not copies from elsewhere in the db)
mytsumhistory
transfer CreatedAt and ChangedAt to mytsummeta
drop table
mytsumdnaannotn
drop table
mytsumdnaseqannotn
drop table
mytsumgenecomp
drop table
mytsumxdblnkrecord
drop PrimaryKey
perhaps drop GO Link, unless it can be shown to be needed over and above Link
what are XDBID, XDBAcc, XDBDesc? drop if unnecessary
change RecordID to RefID
change TableID to RefTable and change content to actual table names
add autonum RecordID
SQL functions
Attach here some files that contain the SQL commands to execute the above changes to the database
