Multimedia:Meeting in Paris/Notes/Metadata
Appearance
- Saturday, November 7, 2009 - room 2
- Moderator: Duesentrieb
- Note-taker: Brion
Structured data!
- metadata about pages/revisions
- metadata about media files
- from MediaWiki [categories etc]
- embedded
- inherent: file size, page count etc
- add'l data [EXIF, ICTP, XMP]
- supplied on description page
- external data
- For batch uploads... would these come into the page?
What's the authoritative source?
- Wikitext as primary source? [editable]
- -> cache in index [searchable, queryable]
- -> can override other sources [exif, file-inherent data]
- External (read-only) authority databases?
- (Do we need/want such or just import things per-source when doing batch uploads?)
Structured data use cases
- media file metadata export for re-use
- api, embedded
- [Modify EXIF or XMP data to embed extra stuff?] [Recommend pushing that to later -brion]
- orig artists, repo, institution...
- Including info from authority files
- using structured data about things+files for search and research
- api, embedded
Let's try to keep this minimal?
- Real use cases?
- search
- license, keywords etc
- reuse
- -> Generate byline
- author(s)
- license
- -> Generate byline
- Main fields... (many of these can be tied to Dublin Core fields, so import/export for academic collections can hit common denominators)
- Fixed data formats
- license
- license features
- dc:rights, creative commons extensions
- geo coordinates
- file format
- jpeg vs tiff vs ogg
- media type
- drawing vs photo vs map
- filesize / resolution
- content language
- dc:language
- source collection + identifier
- source URL
- dc:source
- creation date
- dc:date
- license
- Freetext [can have multilingual variants]
- author
- dc:creator
- description
- dc:description
- source
- dc:source
- "other fields" from foreign source
- keywords
- (problems with consistency? controlled vocabularies with some sources, but don't match up reliably)
- people shown in picture
- etc
- keywords
- author
- Fixed data formats
- search
How to source/edit the stuff...
- source fields from file inherent/EXIF
- scary templates
- Nesting is required for extracting triples sanely...
- {{Information|desc={{en|hello}}{{de|hallo}}}}
- We _think_ we can break this out
- "template" "Information@desc/en" "hello"
- "template" "Information@desc/de" "hallo"
- {{Information|desc={{en|hello}}{{de|hallo}}}}
- Nesting is required for extracting triples sanely...
Normalization of property specs/extraction most important
Normalization of property values can be left for later
Metadata extraction spec
Possibility...
- Define for <dc:description> sourcing from:
- "template" "Information@desc/<lang>"
- "exif" "Description"
- Gives us these trips:
- "File:Hallo.jpg" "dc:description@lang=de" "hallo"
- "File:Hallo.jpg" "dc:description@lang=en" "hello"
- "File:Hallo.jpg" "dc:description" "Created with The Gimp"
- also other sources of info available:
- revision data / uploader / editor
- external authority data [rdf triples given from bulk uploader?]
Possibly spec as a special page or a mediawiki:horrid or config file. ;)
Whatever consider that it may change; so we need to be able to re-run metadata extraction in bulk.
Backend storage & search...
- mysql? (page/field/keyword trips ... or searchindex?)
- lucene? (for each page page: field/keyword pairs)
- specialized triple store? (sparql)
- potentially really cool searches possible!
- recommend leaving this to later - third parties cover this area well like dbpedia and freebase
- Look at what Semantic MediaWiki does for its storage
- Minor note: with RDF triples, subject can be tricky (page vs image vs subject of image)
- SMW -- talk page's subject is the page, so you can put data about a page there ;)
- This seems simpler
Horrible questions: licensing?
- metadata licensing is a scary world out there...
- Let's pretend the problem doesn't exist for now ;)
- [but database protection stuff in Europe may need to be looked at... but we think that hits on our side mostly]
Points of action!
- Good news: exif extraction is already here
- Work: template extraction sounds feasible and not too insane
- [brion: see about a proof of concept to make sure this works like we think :D]
- Work: need to specify an extraction spec language
- Work: core DB storage + query/dump export
- Work: need to add interface to add the fields to search backends
- Lucene <- key for live
- MySQL fulltext <- good for third-parties
Other considerations
- sometimes metadata is about *subject of image* instead of *image*. Ughhhhh?
- at minimum, need to be able to spec for the subject... maybe
- Location: for paintings -- have loc of the painting or the subject of the painting?
Export methods
- raw RDF via API
- XML RDF in special:export and dump XML
- ^if included in dumpBackup and OAI export, our Lucene indexer can easily ingest that!
- RDFa embedded in HTML pages