Guidelines for Data Preparation
So, you are creating your own data, or you found a (little messy) external data source and you'd like to transform it for best analysis. Don't worry, and ...
Keep your data
- Simple
- Readable
- Compact
- Consistent
Good analysis and good visualization starts with good data. Data doesn't need to be perfect and large, but it needs to be well defined so you can make sense of it. The attention and care on how you measure, collect, store, manage, and prepare your data will pay off tremendously. The following guidelines can help your data get in better shape.
Tables
One table shows one entity, record by record. A good example showing Presidents of the United States is below.
- Each data table presents a single entity, such as people, publications, incidents, etc.
- Each record should have a uniquely identifying id, ideally. If not, you cannot create links between records.
- Avoid aggregating records. Each record should describe a specific item or observation.
Using Google Sheets as Databases
Google Sheets is a great solution for small to moderate sizes (a few thousand records, tens of attributes).
- Make sure the first header includes the column (attribute) name.
- You can "freeze/fix" the topmost and leftmost columns, so that when you scroll in the spreadsheet, you always see the attribute name on the row top and the unique record identifiers on the left column (if you place your identifiers on left-most columns, which is highly suggested.
- Read Design Better Tables by Andrew Coyle. And, come back to Keshif to sort, visualize, explore, and analyze your data 😎.
Numeric Attribute
Just the number. No abbreviations. Same scale for all records.
- Remove the unit-names from the columns. If you list price of cellphones, do not encode price as $755, or the height as 7″. Keep only the numeric part. Then, set the unit name [TODO: Link to help].
- Make sure all columns are encoded with the same unit (consistent). Setup for exploration is harder when part of your data is in milligrams and the other part is in grams.
- Do not include abbreviations like "3M" or "3 million". Put down the number using digits, to whatever accuracy you have available.
- Ideally, encode the number in its simplest unit. For example, use "meter" measurements, instead of "kilometer". This is helpful to avoid cases like "1k" kilometers, and allows the interface to show "1M" meters.
Date/Time Attributes
When you load data using CSV or JSON files, the date attributes are loaded to Keshif as strings, and thus need to be parsed to create Date objects before they can be displayed with date/time semantics. Keshif does not internally deal with parsing date strings, and it does not impose any standard or "accepted" formats. Working with date objects is left to data loading, or to specifying attribute summaries. Thus, you will most probably want to use the JS API to create a browser that shows temporal features in your data. If you implement a customized JS callback to load your data, you may also need to parse any date-related string in your input appropriately.
You can use the default JS Date object to represent your time-attributes. For reference, you may use: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date
If you summarize an attribute of data type Date(), keshif automatically shows a line-chart with an appropriate level of aggregation.
You can use d3's time formatting functions to parse or display date strings. For details on the type of formats d3 recognizes, please refer to https://github.com/mbostock/d3/wiki/Time-Formatting. To use d3's formatting capabilities, you need to create a dateFormat object first, ideally before you create new kshf.Browser().
<code>var dateFormat = d3.time.format("%m/%d/%Y");
One option is to convert the type of the data to create a single summary. Assuming the "DateString" field holds values such as "1/25/2011", and using the dateFormat defined above, you may define the value for your summary as:
<code> value: function(){ return dateFormat.parse(this.DateString); }
Another option is to update the data directly after it is loaded. This will give you more control over re-using the same Date() object to extract day/month/year/etc, or to re-format it for display.
<code>loadedCb: function(){ var dateFormat = d3.time.format("%m/%d/%Y"); kshf.dt.Data.forEach(function(d){ d.data.Date = dateFormat.parse(d.data.Date); }); },
You may also use momentjs (http://momentjs.com/) library for more flexibility (dealing with timezones, more advanced formatting and parsing options, etc.). If you choose to convert the attribute to a moment object, you can extract various time/date properties using functions in moment library. An example which extracts hour, date, month, year, and also the full JS Date (linechart) is available at:
https://github.com/adilyalcin/Keshif/blob/master/demo/bookdownloads.html
To display the month/date/year/etc. on a linechart given a JS Date object, keshif retrieves the UTC-based values from date objects. If time-zone accuracy is important, please convert your time objects into the UTC timezone.
-
In future, Keshif may try to auto-detect certain date formats. However, at the moment, it does not impose a specific required format, and offers flexible ways to work with Date fields using JS.
-
Make sure all representations are consistent. If one columns in "May 2015", do not have columns like "2014", "2015-07", "07-2015", etc.
- If the date value cannot be parsed, or invalid, the date value will be left empty for that record.
Categorical Attributes
- You may use standard codes in place of categorical variables, such as ISO 3166 Country Codes, or in place of county names, you can use FIPS Codes. Both examples also represent some geographical context, but same practice can apply to other data points too.
- If the values of categories can be further described/identified, such as "authors" in a paper collection may have data related to them (affiliation, gender, location, etc), use the unique identifier for each item to describe the categories of the current table.
Multi-Valued Categorical Attributes
Multi,Values,Merged,With,Commas
-
Store multiple values of a categorical attribute using a simple separator within a single cell. For example, an article may have multiple tags, with the goal of organizing the list of articles in a searchable and indexed format. Or, multiple authors. You may use "+" as a separator. Just be aware that any valid category text with the specific separator might be parsed incorrectly. ** This is counter to database design, which would ask you to ID each tag, ID each record, and have a separate table that encodes relations between tags and records. But, this violates simplicity and compactness. If all you care about is the list of tags per record, store it that way.
-
If each value has its own attributes, you'll need to have multiple tables to describe the relations. For example, assume you have a list of employees. Each employee can be a member of multiple organizations. And for each membership, there can be a starting date. If you are to store this information, separate the tables. Visualization will be more tricky.
Note: See keshif:AgreeSet for set-based visualization details.
Data File
Google Sheets
JSON If you have JSON files already, keep them that way. You can parse those files easily in keshif using JavaScript. * Ideally, the file should be [{},{}, .... , {}], where each {} is an object with key-value pairs, such as {"id": 10, "name": "Jon", "age": 21, "likes": "data, visualization, biking"}. The keys should ideally appear for all data items (missing key will mean missing attribute per record). If you follow this structure, keshif can detect the attributes in the record collection easily. * If you do not want to implement custom data loading or value() functions for aggregates, do not have a hierarchical data representation per record. That is, if you have a record block {}, none of the attributes should be another block { name: {first: 'John', last: 'Doe'} }. Sometimes the reason is semantic grouping. Sometimes, the reason is merging multiple tables. Remember the simple tabular structure, and flatten your JSON files if you want to keep it simple. * Socrata's JSON formats are a good place to start if you want to do some custom data coding.
Be careful with missing values
- If you have to encode missing value, leave the field empty, or if you have to differentiate "meaning" of missing from incomplete, do simple, consistent coding, such as using "NA" to show empty fields.
Make it compact
- If you can generate one attribute from another, remove the longer attribute. For example, if each record has an integer id (43), and a url (www.example.com/record/43), remove the URL. You can generate it from the existing column easily (using a simple custom formula).
Scaling up: More record, more attributes, larger files
Visualizing 1k record is trivial, 10k is fine, and with 100k, things may start to slow down, at least on browsers. Also remember that all data is transmitted over the web to create a browser, not a subset of it.
The guidelines below are to deal with files that may be a little too large to be displayed and filtered on a web browser. These are not general guidelines.
- Remove large chunks that are not "exploratory". Make these resources available on-demand per record.
Does your data include big chunks of text, descriptions, or maybe even embedded binary images, etc, that result in large files? If so, remove those columns with big chunks, and make these resources available on demand per each record. If you have an image, the image can be retrieved from www.example.com/images/1.png. Or, if you have a text description, you can return it with simple API calls on demand (when the record is visible or more record details are requested.)
- Filter to recent time.
Is your data collected over time? If so, can you start with focusing on the most recent data records for exploration? Then, you are lucky, you can preprocess your data to use the most recent records, like last month, last year, or most recent 5 years.
- "Code" categories.
If an attribute is a categorical variable (string), create a codebook and map long strings to 1-2 letter codes. For example, "Architectural Planning" can be "AP". You can convert these codes to their original text form while visualizing/exploring the data.
- Aggregate records, and store the number of records per aggregation.
This is counter to one of the first guidelines: storing each record individually. However, there are cases where such data can be of millions of items. A good way to bring down the data size is for each row to represent multiple records in the original data. Use this method only when absolutely necessary. Note that, each record can still appear in one row only in this form of aggregation.
Other Good Guidelines We Like
Sample Datasets
Below are some sample datasets that mostly follow these guidelines. These datasets may have some formatting issues too. They are not tailored for absolute conformance. However, their structure allows for effective visual analytics, and they are easy to read. They are chosen from the keshif datasets at Google Sheets for convenience.