Wednesday, September 24, 2014

Week 5 Reading: Uniquely Identify This

So, metadata.  It's quite a buzzword here at the iSchool and, until this week's reading, I had only a vague notion of what people meant about data-about-data.  Thankfully, this set of articles really helped.

From its humble roots as a data organization system for geospatial data management systems, metadata has taken the information world by storm.  Instead of previous methods of categorization and organization to provide storage and use, metadata not only describes the content of an object but also its behavior.  That is, well developed metadata is more than an isolated description of contents and provenance; it describes its use, the history of its use, storage, and management across changing media landscapes, as well as its relationship to the contents and uses of other data across diverse fields.

Different types of metadata can be used to track and organize information at many levels, as well.  If a librarian needs to create a finding aid, she'll use descriptive metadata, as opposed to an archivist looking to document a recent conservation project (he'd manipulate preservation metadata).  The internet generates its own metadata at an alarming rate.  Each of these situations calls for its own classification and encoding systems.  It is increasingly difficult to achieve the goal of generating metadata in the first place.  That is, to create a richer and more accurate body of information in its complex context as it interacts with the changing landscape of knowledge.

To add to all of this, technology changes at such a rate that networked information needs to migrate, so metadata "has to exist independently of the system that is currently being used to store and retrieve them" (Gilliland, 2008).  This requires a high level of technical expertise that has resulted in a rising sense of panic that I've been reading about in my other classes.

Different fields of study value different kinds of information, however, and there is no consistency to track their contents across disciplines; enter Dublin Core!

Have you ever read The Hitchhiker's Guide to the Galaxy?  The Dublin Core Metadata Initiative (DCMI) seems like they're trying to build a Babel Fish, which in Hitchhiker's Guide is a little fish you can put in your ear that instantly translates any language you hear; you can understand anyone in their first language, and you can be universally understood.  From what I can tell, the Dublin Core Metadata Initiative is trying to create a universal translator.  Working within the Resource Description Framework (RDF), Dublin Core identifies the specific Markup language in use and "speaks" in that language.  That is, it pulls up the context-correct dictionary for the data in question, points to a specific definition, and then uses it in the query, what Eric J. Miller calls "modular semantic [vocabulary]".  For instance, if you wanted to know about famous hospitals in the 1800, the DMCI would do the heavy lifting of specifying field-specific classification schemes, and you would get results from systems that use LC, DDC, Medical Subject Headings, and maybe AAT, too.  So, generally, the goal of DCMI is to act as translator for well-established data models in order to allow for a more flexible interdisciplinary discovery system.  Inter! Opera! Bility!


Week 4: Muddiest Point (09/23/14)

To be perfectly honest, this week's class was more illuminating than question-raising for me.  The semantic and logical nature of building data models is somehow satisfying and interesting to me, and I'm looking forward to Assignment #2.

I would like to know what kind of work is done with data stored in an "analytic" database (data warehouse?) as opposed to a retrieval database like the kinds librarians use, but I bet I can go find out about that myself.

Oh! How do you save a database with its attendant query history in Access?  I saved the query we did in lab yesterday, and I saved the database, but they don't seem to have saved the same information.

Thank you!


Thursday, September 18, 2014

Week 4 Reading: Database, My Database, and Nothing is Normal(ized)

My mother-in-law worked for the government in the 1970's.  She has told me tales of these giant drums that were the databases of the era.  They a lot of physical manipulation and waiting.  Is this a navigational kind of database, wherein the user had to wade through combinations of inflexible, predefined paths of data?  Applications had to sort through linked data sets within a larger network; this required inefficient levels of time, training, and funding.

Enter the relational database, much more flexible and interrelated in that it uses and interrelated series of table to reference one another on a query-basis.  I would be assigned a "key", let's say my first name, all subsequent data about me on other tables could be accessed using that key.  If all my personal information (address, phone number, birth date, social security number, favorite flavor of ice cream, etc) were listed in separate tables as defined by those categories, a relational database is able to call up the query-relevant data only when it is needed.  A standardized search term like this improves the capacity and fluidity of the database.

Subsequent evolution of database structures seem to be based on the relational approach (it seems that Moore's Law was the downfall for the "Integrated approach").  The question then became one of developing relevant and effective query languages as well as refining schemata for increased and interrelated data.  It seems that SQL remains on the top of the pile in terms of query languages, even though it's been through various permutations for different uses.

With a vast increase in end users in the 1980's (desktop computers! Welcome, the hoi polloi!), users were left the task of manipulating data, and DBSMs would quietly go about their business of decompressing, reading, and recompressing files.  With the advent of end users, data became more tied to a "user", rather than a user being tied to disparate data concerning her, the beginning of the "profile".

These days, there are many types of databases to serve different data needs.  For instance, a database that manages ambulance dispatch would probably be an in-memory database because, as the Wikipedia article states, "response time is critical".  Many people now use cloud databases to have access to their data and information from any physical point that has internet connectivity.  Libraries are increasingly serving as data warehouses.  As digital scholarly publishing and discourse happen in digital spaces, the warehouses allow for mining and managing data "for further use", which I think means the development of metadata- very much a buzzword in the LIS field today.  In addition, current digital journals or collections thereof, like JSTOR, are hypertext/hypermedia databases, allowing for papers to be linked to research data and referenced works, for instance.  Federated databases are relevant to library work in the creation of, say, the European Digital Library, where disparate institutions share their collections and information.

Designing databases starts with understanding exactly what data is going to be organized for storage and retrieval.  This conceptual modelling informs the actual data structure; the data structure is dependent on the database technology and its attendant DBMS, i.e., the logical data model is not conceptual at all, but is instead informed by the requirements of the database management system.  There are lots of these models and, to be frank, I almost understood them, but not really.  Not really.

Data modelling's first step is normalization; from what I can deduce, normalization is the paring down of raw data until it's lean, mean, and internally consistent enough to be treated as a body of data to be placed into a database (that you are building according to the demands of the DBMS).  Eliminate useless redundancies and similarities (which brings us to the best term of the reading: atomicity) and assign each reduced data set a primary key (see below);  from here, the cached webpage became very difficult for me to understand, as it was so referential to the diagrams, which unfortunately were missing.  From what I can tell, the next step in normalization requires that each primary key (having trouble with the concept of "concatenated keys") is distinct in itself and does not rely on association with any other key; if association occurs, creating a new table with all associated keys is in order.  The last step in normalization is identifying non-key attributes, as they relate to the key attributes being used in a table.  I am also unclear as to the one-to-many, many-to-one, and many-to-many relationships.  Visuals would definitely help.

Peter Chen's data modeling system begins with a conceptual mapping of the structure; this concerns the dividing up into tables that will house the relevant data, and creating a cartography of relationships between the tables.  After the conceptual framework is set up, tables are populated with data in a way that conforms with the dictates of the DBMS.  In order to best define the relationship of data between one another, each datum is considered an attribute (noun); "relationships" are expressed as the connection between these nouns.  Unique attributes assigned to an entity create is "primary key".  The semantics of the entity-relationship model are in grammatical terms, which really helps me keep a handle on the concept.  Illustrating the types of connection between entities and relationships can be visually represented in many ways, known as "cardinality constraints".

Meanwhile, in order to keep a database healthy and happy, it seems that developing redundancy and the ability to replicate older forms (what if a migration goes wrong?) of a database are key.  Back up your data, folks, that's always been rule number one.

This is a lot of information, and I look forward to seeing some visual aids!

Wednesday, September 17, 2014

Muddiest Points: Week 3 (09/16/14)

Greetings, and welcome to my Muddiest Points for LIS 2600, Week 3!

I was wondering if you could integrate RLE within an image that requires more complex compression techniques.  That is, can there be more than one method of compression within a single file?  Or is this illogical?

Also, I am still trying to get my head around binary.  I though I understood it, but then there was that Foxtrot comic and I got confused.  I made a little chart, and am wondering if it's correct?

So, if there are four bits, there are a potential for 16 values.  I tried to make a chart and would like to know if I got it right.  I know it isn't necessarily relevant to what we were learning in terms of compression, but I didn't understand the explanation very well; now I want to know if I understand the concept.  So, here's a chart of 4 bits (four powers of two, right?) and their decimal equivalents.  I apologize that the underlining is a little weird:

23            22                 21              20          Decimal Equivalent
0              0              0              0  ………………. 0
0              0              0              1  ……………..…1
0              0              1              0  ………………..2
0              0              1              1  ……………..…3
0              1              0              0  ………………...4
0              1              0              1  ………………...5
0              1              1              0  ………………...6
0              1              1              1  ………………...7
1              0              0              0  ………………...8
1              0              0              1  ………………...9
1              0              1              0  ………………...10
1              0              1              1  ………………...11
1              1              0              0  ………………...12
1              1              0              1  ………………...13
1              1              1              0  ………………...14
1              1              1              1  ………………...15

Thanks!
-Mary


Thursday, September 11, 2014

Week 3 Readings : Data Compression Almost De-Mystified But Not Quite, Historic Pittsburgh is the Best Pittsburgh, and Duh, We All Use YouTube

Run Length Encoding (RLE) and Lempel-Ziv (LZ) methods of compression are fashioned to compress repetitive texts; RLE deals better with sequences of identical value, such as AAAAHHH NNOOO!!!, but not something like "Data compression is a conceptual challenge for Mary Jean".  Therefore, RLE is better for compressing low-contrast images; super-long sequences (many pixels of the same color, for instance) can be compressed by sorting by channel.
In the case of multiple patterns of different lengths, LZ handles this information in a more efficient manner; it uses self-referential data gained from it previous iterations.  The idea that we compress speech and text in daily life, as well, was particularly helpful for me in understanding how LZ works.

See?  I did it just there.

Both RLE and LZ are strategies that produce lossless compression, resulting in perfect reproduction of the uncompressed form, as well as allowing for the process to be reversible.  Other lossless compression methods include entropy coding (so that's what encoding means!), which assigns a value to the data inversely based on its statistical probability of occurring in a given set.  In this way, smaller assigned values for more frequently occurring data ensures a smaller file size.  There was a lot of math here, and it frightened me.  However, lossless compression is integral for compression, programs, as a single "misfire" of information will send the whole finely wrought program down the drain.  The DVD-HQ article also mentions that lossless compression is ideal during "intermediate production stages"; so, I guess lossless is necessary for things that absolutely have to be preserved as they originally appeared.  For example, if you were editing speech tracks for a movie you're making, you'd want to use lossless compression because, even though it takes up more space, you need an exact copy of the original recording to work with.

Lossy compression deals more with the difference between information and data; during decompression, certain pieces of data can sloughed off ("lost", right?) while not losing the essential rendering, as opposed to exact digital copy, of the end product.  Lossy compression is more efficient for audio and video information, which makes sense because the information is much more complex than a static image comprised of shades of color; instead, they are dynamic pieces of information that the "end processor" (a person, for example), well, processes in a more recondite way.  A particularly helpful example for me from Wikipedia was ripping a CD; lossy compression shrinks the file size by eliminating, though the fabulous term "psychoacoustics", less audible or irrelevant sounds.  The result is an inexact copy of the original data, but a form of consumable information nonetheless.
It's interesting to note that most of the information we receive is lossy.  That is, streaming video, cable, DVDs, mp3s, satellite radio, etc.  Makes me think about what we're missing.

Both Imagining Pittsburgh and  YouTube and Libraries focus on the end-user benefits of integrating information that's been compressed and then decompressed into library services; I imagine they used some form of lossless compression, as the images are static, and largely grayscale.  The Imagining Pittsburgh article delineated the plan for creating the really wonderful Historic Pittsburgh database.  I am a frequent user of this service, both in my professional and personal lives.  As an aside, it feels like it hasn't been refined since they launched the site 10 years ago; searching is difficult and clunky.
The article, meanwhile, highlights the processes three major Pittsburgh institutions went through to create a collaborative space that tells the story of this city through images, maps, and population data.  It's proof positive of why things like data compression matter; it provides an example of practical applications of the skills we're developing in this course.  The capacity for and output of cross-disciplinary, inter-organizational collaboration is so greatly increased on a digital platform.  The article also gives a step-by-step breakdown of how information technology integrates with the goals of organizations.  Imagining Pittsburgh is a fine example of a professional document meant to reflect accountability and expertise in the face of funders and professional organizations.  Meanwhile, none of it would have been possible without the ability of data compression.

While the YouTube and Libraries article demonstrated an implementation of lossy compression, it nevertheless seemed, to me a little hokey; by this point everyone knows the democratizing value of YouTube.  I initially wondered why the author didn't suggest embedding the helpful videos she proposes ("how to find the reference desk") onto the library's own website, but I do understand that the wide popularity of the juggernaut that is YouTube would probably garner more views than on the library's site itself.  Overall, though, the article seemed to me to reflect the latent fear of new technology, born from the fear of obsolescence, that is a passing trend among librarians as a new generation steps into the field.

Tuesday, September 9, 2014

Muddiest Points: Week 3 (09/09/14)

Today was a lot of information to take in.  I didn't quite finish the whole Lab worksheet, but I think that's okay because of time constraints from discussing Assignment 1.

I thought I knew a little about computer basics, but there was so much information that I couldn't sort through what, exactly, were the main points.  Lots of slides, as Dr. Oh said.  It muddled my comprehension.  I felt woefully under-prepared to complete the Lab worksheet; for instance, from the worksheet we were (thankfully!) provided in Lab, in the "Hardware-Memory and Storage" row, what's a Card Reader?  Where do we find graphics hardware?  I couldn't seem to find any information.  Did we go over this in lecture?  I might have missed it because of the above-mentioned rapidity of slides.  Also, what are "Inputs and Controls" (in the "Power and Expansion" row)?

I would also like clarification on Moore's Law.  Is it important to understand because technology changes so quickly, and it's good to have a schedule to know when we should "update" our working knowledge?

I was also wondering if it's feasible or even reasonable to replace my hard drive with a solid state.  All my IT friends are telling me to do this, but I'm not sure it's necessary.

Thanks, and I'll see you next week!

Tuesday, September 2, 2014

Week 2 Readings: The Fast Lane, Google Might Be Evil, and PC Deployment Schedules Are a Drag

This week's reading focused mainly on the necessity of bringing librarianship in line with the changing nature of information, the attendant stumbling blocks, both technological and cultural, associated with doing so, and the murky landscape in which it all takes place.  A practical solution is also put forward.

Charles Edward Smith's piece, "A Few Thoughts on the Google Books Library Project", addresses the escalating sense of panic among libraries that has arisen because Google is on the forefront of digitization before the supposed information professionals got on board.  He reassures us, however, that putting literature and knowledge online provides a wider platform for access.  Instead of rendering academic libraries obsolete, the Google Books Library Project serves as a model for the "successful transfer of knowledge".  Building online information resources will actually liberate scholarship and research from traditional barriers of access like physical proximity, scarcity of resources, interlibrary loans, etc.  He stresses that material that is not digitized will no longer count; people rely almost solely on digital resources, and anything that remains only analog will fall by the wayside;  in fact, analog-only material can be said not to exist in the developing world of information technology.  Smith is excited by the possibilities digitization presents, and encourages professionals not to be afraid of it but to embrace it wholly.  The problem I see with our brave new world of information, however, is that it can limit the scope of one's research.  The infinite amount of information available online is governed by search terms defined by the user and his/her preconceptions about the material, and as such can lead to the researcher wearing "blinders", as it were, through specifically tailoring his/her search to exclude "extraneous" results.  This specificity can inhibit exposure to material the researcher might not have thought of as relevant, but may be nonetheless.

Meanwhile, in Europe, librarians fear the America-centric nature of Google.  When trying to build the European Digital Library in order to provide wide access to out-of-print and old texts, libraries have struggled with alliances with the internet giant.  It seems that public funding is sparse or non-existent, and the project has turned toward private funders to get the Library online.  This, of course, involves Google because it has already dealt with publishers and book sellers during its quest to put books online.  Doreen Carvajal's article "European libraries face problems in digitalizing" brings the question of Google's apparent American bias into focus.  I'm not surprised that, as an American, I had absolutely no idea that this was the case.  Google's tendency toward American information would seem to contradict the mission of the European Digital Library; it seems that the company needs to be on board to make anything happen, however.  The relationship between providing a free, online and public library and operating in a world where private money seems to control what goes on is still an unsolved point of tension.

While both of the above-mentioned articles display a little theoretical hand-wringing over what a 21st century library will look like, the University of Nevada, Las Vegas got its hands dirty by building one of its own.  Jason Vaughn's article "Lied Library @ 4 Years: Technology Never Stands Still", breaks down the practical steps necessary to create a connected, efficient, and relevant
library space.  From actual PC replacement, building staff competencies on both hardware and software, managing physical space, keeping systems healthy, and preventing loss, to developing an ongoing plan to maintain relevance in a rapidly changing information environment, it seems that UNLV's librarians have been awfully busy.  Some of the technological changes they made back at the beginning of the millennium are now laughably and thankfully obsolete; the prevalence of "community use" policy makes me think that the public libraries weren't as connected as they were today; widespread wireless connectivity was still a long way away and so they needed to install lots of "hot jacks"; Deepfreeze was a new technology; open source software was almost unthinkable.  However, this article displays how an honest recognition of the field's evolving nature is baseline necessary to implement effective change.  It takes a lot of effort, and a lot of money, but is less expensive than refusing to change at all.