Land, Labor, & Capital

Scroll

Musing on economic development, economic history, political economy, & big data techniques for social scientists. Continue

Quick Note: Upgrading to Microsoft Open R in Linux

Below is a guide to installing Microsoft Open R for those using Ubuntu Linux and R-Studio. Microsoft R Open (MRO) is a pretty powerful distribution of R (and yes, completely open source). It builds off (currently) R version 3.3.0, adding some darn useful enhancements, due in part to their use of Intel’s Math Kernel Library (MKL).

What is R Open + MKL?

What does this mean? Basically Microsoft R Open is an optimized version of your standard R, and can help tremendously if you’re doing, say, many intense vector or matrix-based operations. Importantly, much of this is done for you: instead of having to fiddle around with multi-core support in your code, R Open optimizes things automatically.

Before stumbling across Revolution R, the predesessor to Microsoft R Open, I didn’t know there were other distributions of R floating around. Switching to Microsoft R Open simply means you’re installing an enhanced version of R over the default version. Everything works the same. There is nothing special you have to do (other than installing your previous packages, which is covered below). If you’re using RStudio, like most people out there, RStudio will automatically use the new, enhanced R.

Upgrading to R Open

Before installing Microsoft R Open, you want to keep track of all the packages you’ve installed. Because we’re installing a new version of R underneath RStudio, we wont have access to our old packages.

The following R code saves a list of packages that are currently installed in R. A .Rda file containing the installed package list will be saved in your default working directory. (If you need a reminder of your current working directory type getwd() into R.)

# Save a list of packges install in R.
temp <- installed.packages()
installedpackages <- as.vector( temp[ is.na( temp[ , "Priority" ] ) , 1 ] )
save( installedpackages , file = "oldpackages.rda" )

The above code is essentially the same as Data Science Riot.

Once we setup R Open, we’ll evoke this .Rda file to automatically (re-)install the old packages.

Next, we download and install Microsoft R Open: https://mran.revolutionanalytics.com/download/.

If, like me, you juggle multiple systems and can’t keep track of which version of Ubuntu you have, type the following command into the terminal:

From your terminal:
lsb_release -a

Then download the appropriate version of R Open and the MKL libraries.

After downloading, first install the Microsoft R Open Debian package:

sudo dpkg -i ./Downloads/MRO-3.3.0-Ubuntu-15.4.x86_64.deb
sudo apt-get install -f

Next install the MKL libraries by unzipping and then running the installation script located in the /RevoMath directory:

tar xvfz ./Downloads/RevoMath-3.3.0.tar.gz
cd RevoMath
./RevoMath.sh

If everything has gone smoothly, the extra math libraries will have now been installed onto your system.

Now, re-open RStudio. Magically (and hopefully), RStudio should recognize the new R Open program and automatically setup the multi-core support. I have installed R Open on all my systems and have never had issues with RStudio recognizing the new flavour or R. If things have gone well, you should see the following message in the console RStudio console:

Microsoft R Open 3.3.0
Default CRAN mirror snapshot taken on 2016-06-01
The enhanced R distribution from Microsoft
Visit https://mran.microsoft.com/ for information
about additional features.

Multithreaded BLAS/LAPACK libraries detected. Using 4 cores for math algorithms.

The message confirms that the combination or Microsoft R Open and the associated libraries have blessed you with multi-core support. Which saves you from having to setup a multi-core “by hand.”

Now that RStudio “sees” Microsoft R Open as our default version of R, we return to our .Rda file and use it to automatically install all the old packages:

load( "oldpackages.rda" )
temp <- installed.packages()
installedpackages.new <- as.vector( temp[ is.na( temp[ , "Priority" ] ) , 1 ] )
missing <- setdiff( installedpackages , installedpackages.new )
install.packages( missing )
update.packages()

If you’re like me, you may have sit back for a while and watch R hypnotically re-install the list of packages onto your distribution.

Read More

Concrete Development & Path Dependent Capital in Afghanistan

It’s hard to imagine, but in the 1950s and 1960s Afghanistan was in the throws of mid-century developmental modernism. Western and Communist Bloc aid flowed in. Western-style structures rose around Kabul. West German machinery filled new government printing plants and Czechoslovakia invested in a number heavy industrial projects, including a 5 million dollar concrete factory located outside of Kabul.

Negotiated in 1955 by a new Minister of Finance, (General) Abdul Malik, who was inspired by the developmental planning of Turkey, the Czech-built plant opened in 1957 in the district of Jabal Saraj, near Kabul.

Shut down by the Taliban in the mid-1990s, the plant reopened spring 2016 after a 20 year hibernation. Foreign Affairs has run a featured photo essay on the vintage, state-owned plant, whose Soviet-era equipment (with major updates) supplies a thirsty domestic demand for material inputs:

Remarkably, the outdated machinery was still functional, able to grind limestone into dust and churn out 100 tons of cement a day. That is no small feat for a Cold War-era plant, even if its output is dwarfed by the millions that Pakistan and Iran produce. Still, its value is worth more than its immediate profits. It has provided 150 new jobs within the factory walls and 5,000 more, indirectly. The domestic demand for cement, according to the Afghanistan's Ministry of Mines, is 'huge,' not to mention the six million tons it currently imports and hopes one day to replace with a more homemade variety.

A lot of development economics in one paragraph.

Photos: Photos from Ahmad Masood in “Afghanistan’s Concrete Plans.” Foreign Affairs. 17 July 2016. Web. 17 July 2016.

Read More

Mapping Developments Middle East

Mapping The Manbij Offensive , summer 2016. The (currently ongoing) operation by Syrian Democratic Forces seeks to wrest control of Manbij from ISIS control:

<img src=”/assets/manbij-offensive-f.gif” width=700px/>

These beautiful visualizations of Syrian/Middle East conflict are from the portfolio of Imago Pyrenaei/LCarabinier, who I’ve seen cropping up on Reddit and GIS twitter (@LCarabinier).

<img src=”/assets/5m-syria-iraq-kurdistan.png” width=700px/>

Read More

One Helluva Property Right

Bloomberg published a long-form piece on the sale of a “Texas Ranch That’s the Size of a Small Nation”. In fact, the Waggoner Ranch property is the largest U.S. ranch within a single fence:

Waggoner Ranch

The rapid expansion of the estate, from the late 1880s through the turn of the century, corresponds to the period covered by Dr. Rick Hornbeck’s great paper, “Barbed Wire: Property Rights and Agricultural Development” in the Quarterly Journal of Economics (2010).

Read More

Input-Output Tables & the Leontief Inverse in R - Part I.

<img src=”/assets/surahammarssweden.jpg” width=700px/> Surahammars Ironworks/Surahammars Järnbruk, Sweden, 1919.
From Sweden’s Tekniska Museet photo collection.

In input-output economics, the Leontief inverse (i.e. [I-A]^-1) is ubiquitous. Named after the father of input-output economics, Wassily Leontief, the matrix is a compact representation of the ripple effects in an economy where industries are interconnected. A lone matrix coefficient conveys all direct and indirect effects on output in one sector required by a unit of output from another sector.

Below is Part 1 of a two part tutorial on deriving the Leontief inverse using R. This first part is a “toy” example to motivated the pieces of the input-output analysis and the workflow in R. Part 2 describes how to calculate the Leontief inverse from a full scale input-output table.

A Toy Input-Output Model.

Consider a baby example. I’ll use Table 1 as a guide to calculating a simple Leontief inverse using R. The table represents the essential ingredients of common input-output tables using only two sectors.

Table 1. A Small Input-Output Table

<td colspan=2>
Intermediates
</td> <tr style="border-top:1px solid darkgray;""> </tr> <tr style="border-top:1px solid darkgray;""> </tr>
From / To Good 1 Good 2 Final Goods Total Output
Good 1 150 500 350 1000Good 2 200 100 1700 2000

The above example borrows from the canonical examples in chapter 2 of Miller and Blair’s Input-Output Analysis (1985) as well as chapter 2 of Leontief’s Input-Output Economics (1986).

The heart of the table is a two-by-two matrix representing the intermediate good flows between the two sectors: sector 1 and sector 2. A row represents the value of output sent from a goods sector for productive use in a column sector. Above, a row sector sends goods to itself and sector 2.

After the two columns of intermediate good sales, the “Final Goods” column shows the value of a row’s output used as final products–output not used in production. If we add up a row’s output used as intermediate goods and as final products, we get the last column: total output.

Calculating the Matrix

The Leontief inverse is calculated in the following way. We start with an IO table like the one above. Using this basic IO table, we generate a “technical coefficient matrix,” which we then use to solve for the Leontient inverse matrix, L.

First we’ll build the input-output table in Table 1 using R. We generate the two-by-two flow of interindustry sales (flowtable). Then I create the vector of finaldemand.

# Intermediate flow matrix.
flowtable <- rbind( c( 150 , 500 ), c( 200 , 100 ) )
# Final demand.
finaldemand <- rbind( c( 350 ), c( 1700 ) )

We combined these pieces into a data.frame object. Once combined, we sum across the intermediate input columns and final demand column to produce a new variable: total demand. The result is a data.frame version of Table 1.

# Bind into input-output table.
inputoutputtable <- cbind( flowtable , finaldemand )

# Convert object to data.frame.
inputoutputtable <- as.data.frame( inputoutputtable )

# Name columns of table (dataframe)
names( inputoutputtable ) <- c("x1" , "x2" , "finaldemand")

# Calculate total output, add final demand and intermediate columns:
inputoutputtable$totaloutput <- inputoutputtable$x1 +
                                inputoutputtable$x2 +
                                inputoutputtable$finaldemand

# Show the small IO table.
inputoutputtable
>   x1  x2 finaldemand totaloutput
>1 150 500         350        1000
>2 200 100         1700       2000

# Save total output vector as a separate object. Use later.
totaloutput <- inputoutputtable$totaloutput

Now we can derive a technical coefficient matrix, also called matrix A. A column of this matrix represents an industrial recipe used to produce a single industry good.

Matrix A is calculated by dividing intersectoral flows by the total output of each column’s sector. Specifically, sector 1 ships 500 dollars of good 1 to sector 2, which produces 1000 dollars of total output. Thus, one dollar of good 1 is absorbed to produce 25 cents of sector 2’s output.

To calculate matrix A in R: first take the inverse of the total output vector and multiply it with an identity matrix. The resulting object, z, is multiplied again with the flowtable matrix.

## Calcate coefficient matrix:
z <- ( totaloutput )^-1 * diag( 2 )
A <- flowtable %*% z

# Show A
A
     [,1] [,2]
[1,] 0.15 0.25
[2,] 0.20 0.05

Alternatively, we can use R’s sweep() function to calculate A directly from the flowmatrix and the totaloutput vector. sweep() takes the input matrix and divides each column by the corresponding entry of the vector. The argument margin = 2 tells us we’re “sweeping” over the columns of the input matrix, as opposed to rows (for row-wise calculations, margin = 1).

# Using "Sweep"
A.alternative <- sweep( flowtable , 
						margin = 2 , 
						totaloutput , 
						'/' )
A.alternative
     [,1] [,2]
[1,] 0.15 0.25
[2,] 0.20 0.05

Finally, the Leontief matrix is calculated in the following way.

Using R, we first calculate I-A, substracting the technical coefficient matrix from the identity matrix. We then invert the I-A matrix by using thesolve() function. The result, L, is the Leontief coefficient matrix.

# Identity matrix minus technical coefficient matrix.
IminusA <- diag( 2 ) - A

## Calculate inverse.
L <- solve( IminusA ) 

# Show Leontief matrix.
L
          [,1]     [,2]
[1,] 1.2541254 0.330033
[2,] 0.2640264 1.122112

Substantively, the matrix L summarizes the network effects generated when final output changes. A single coefficient of matrix L, surprisingly, summarizes all direct and indirect effects created in sector i to supply a single unit of final demand for sector j.

Read More

Forty Years of the S (& R) Programming Language

O.g. data scientist and S developer, Richard Becker, gave a shamelessly wonky keynote address on the “Forty Years of S” (and thus, the deep history of the R language). Great insights into the wild innovative collaborations (and mythologies) that grew out of Bell Labs, and dense with data science history – including the birth of R’s “<-“ arrow operator. Gold.

Read More

The Weird History of Microfilm

From the University of Haifa Younes, Soraya Nazarian Library, vis-a-vis Atlas Obscura.

The wonderful and underrated online publication, Tedium, has an intriguing (and hilarious) longform piece on the history of microfilm: Honey, I Shrunk The Page</strong>.

"Microfilm’s first innovation: Improving carrier pigeon efficiency" - Ernie Smith, Tedium
Read More

A Hunt for the Oldest Government Computer

Last year Popular Mechanics and WOODtv reported that a single Amiga has been running non-stop for three decades, controlling the heating and cooling system of nineteen Michigan public schools.

In the spirit of vintage computing, the prolofic researchers at Muckrock.com have a really great ongoing project: the hunt for the government’s oldest computer.

Allan Lasser of MuckRock, in his original blog post for the project:

As the saying goes, “If it ain’t broke, don’t fix it.” If a machine is doing its job, reliably and without error, then common sense dictates that you just shouldn’t mess with it. This is doubly true for computers and quadruply true for government computers. This lends itself to an obvious question: what’s the government computer most in need of an upgrade?
Read More

Stalinist Russia from the Trondheim Archives

Delving into the fantastic collection of the Norwegian Municipal Archives of Trondheim, which has some eerie photograph’s from journalist Eirik Sundvor’s visit to Stalinist-era Soviet Union (among many other beautiful gems.)

Read More

Rock Hudson's Telegraph to Nancy Reagan Before His Death From AIDS

The Ronald Reagan Presidential Library care of Chris Geidner of BuzzFeed News

A great long read from BuzzFeed New’s Chris Geidner with a snippet of Rock Hudson’s telegram to longtime friend, Nancy Reagan. A few weeks before his death, the Hollywood star appealed to Reagan for assistence in transferring to a French military hospital for AIDS treatment. Quoting Hudson’s publicist at the time, Dale Olson:

“ ‘Only one hospital in the world can offer necessary medical treatment to save life of Rock Hudson or at least alleviate his illness,’ Olson wrote. Although the commanding officer had denied Hudson admission to the French military hospital initially, Olson wrote that they believed ‘a request from the White House … would change his mind.’

First Lady Nancy Reagan turned down the request.”

Read More

Readings in the History of Asian State Capacity

A clip from the “Royaume d’ Annan comprenant les Royaumes de Tumkin et de la Cocinchine / Designé par les Peres de la Compagnie de Iesus; I. Somer sculp. - 1600-1699”, from the La Bibliothèque nationale de France.

The following is a reading list related to a long undertaking with my two co-authors a, Melissa Dell (Harvard) and Pablo Querubin (NYU), where we explore the historic roots of economic divergence in Asia. Our paper, “State Capacity, Local Governance, and Economic Development in Vietnam” examines the role of historic state institutions in shaping the different developmental trajectories of Northeast and Southeast Asia.

We take Vietnam as a parable for the larger Asian experience: After independence from China, Vietnam (Dai Viet), much like Korea, consciously adopted a the modern Chinese state, complete with centralized administration and a competitive bureaucracy. On the other hand, large parts of Southern Vietnam were controlled by Cambodia (the Khmer Kingdom), which, like many Southeast Asian states, was far less centralized than their Sinicized neighbors. Using historic boundaries as a divide between these two institutional worlds, we use a regression discontinuity design to examine how the different patterns of early state formation influenced long-run village-level development.

Understandably, the history of Asian political development isn’t known to many economists. However, the history of European state capacity has influenced a wave of research in the political economy of development. Classic works by eminent historians like Charles Tilly or Perry Anderson have become commonplace, informing the way economists explore the role of state history and economic development.

Our paper on state capacity appeals to a rich literature on Asia – one that may be less well known to economists. Since my co-authors and I spent much time delving into these works, much of which can’t fully be discussed in the scope of an empirical economics paper, I wanted to share some of these readings on Asian formation in Asia, emphasizing Vietnam.

Asian State Formation in a Comparative Perspective.

A great starting point for understand the patterns of long-run Asian state formation would be Victor Lierberman’s wonderful two volume history, Strange Parallels. Lieberman synthesizes a massive literature on political development, comparing Asian state history to that of Europe and Eurasia.

  • Lieberman, V. Strange Parallels: Southeast Asia in Global Context, C. 800–1830. Volume I: Integration on the Mainland. Cambridge University Press, Cambridge (2003).
  • Lieberman, V. Strange Parallels: Southeast Asia in Global Context, c. 800-1830. Volume II: Mainland Mirrors, Europe, China. South Asia, and the Islands. Cambridge University Press, Cambridge (2010).

</small>

The Chinese State and Historical Congruity in East Asian Political Development

The Chinese state was surprisingly modern relative to its contemporaries and its institutions were copied across Asia. Lost Modernities by Woodside is a great synopsis of the common bureaucratic state features that distinguished the experience of East Asian states from elsewhere in Southeast Asia.

  • Woodside, Alexander. Lost Modernities: China, Vietnam, Korea, and the hazards of world history. Harvard University Press, 2009.
  • Kang, D. C. (2013). East Asia Before the West: Five Centuries of Trade and Tribute. Columbia University Press.
  • Whitmore, J. K. (1979). Merit subjects: China, Korea, Vietnam. Ming Studies, 1979(1), 42–51.
  • Woodside, Alexander (1998). Territorial order and collective-identity tensions in Confucian Asia: China, Vietnam, Korea. Daedalus 127.3, 191-220.

</small>

The “Indianized” States of Southeast Asia

Building on the scholarship of George Cœdès, a rich literature has explored the shared characteristics of early Southeast Asian polities and the influence of Indian civilization on the region. O.W. Wolter’s work is a fantastic tour of Southeast Asian pre-colonial states, motivating the concept of mandala kingdoms to describe the organization of these discordant polities.

  • Wolters, O. W. (1999). History, culture, and region in Southeast Asian perspectives. SEAP Publications.
  • Cœdès, G. (1966). The Making of South East Asia. University of California Press.
  • Day, T. (2002). Fluid iron: state formation in Southeast Asia. University of Hawaii Press.
  • Kulke, H. (1986). The early and the imperial kingdom in Southeast Asian history. Southeast Asia in the 9th to 14th Centuries, 1–22.
  • Tambiah, S. J. (1977). The Galactic Polity: The Structure of Traditional Kingdoms in Southeast Asia. Annals of the New York Academy of Sciences, 293(1), 69–97.

</small>

Chinese State Influence in Vietnam.

From Alexander Woodside’s famous 1971 thesis, a number of works have documented the ways in which early Vietnamese leaders adapted Chinese state institutions, emphasizing the role of elite bureaucratic norms.

  • Woodside, A. (1971). Vietnam and the Chinese Model: A Comparative Study of Vietnamese and Chinese Government in the First Half of the Nineteenth Century. Harvard Univ Asia Center.
  • Cooke, N. (1994). Nineteenth-Century Vietnamese Confucianization in Historical Perspective: Evidence from the Palace Examinations (1463-1883). Journal of Southeast Asian Studies, 25(2), 270–312.
  • Smith, R. B. (1973). The cycle of Confucianization in Vietnam. Aspects of Vietnamese History, (8).
  • Whitmore, J. K. (1969). The development of Le government in fifteenth century Vietnam.

</small>

The Vietnamese Village as a Site of State Capacity.

The Vietnamese village has long been viewed as the fundamental unit of administration. Our paper hones in on how the central Dai Viet state projected power down to the village level, shaping the scope of village development for centuries.

  • Nguyen The Anh. (2003). Village versus State: The Evolution of State-Local Relations in Vietnam until 1945. South East Asian Studies, 41(1), 101–123.
  • Popkin, S. L. (1976). Corporatism and Colonialism: The Political Economy of Rural Change in Vietnam. Comparative Politics, 431–464.
  • Yu, I. (2001). The Changing Nature of the Red River Delta Villages during the Lê Period (1428-1788). Journal of Southeast Asian Studies, 32(2), 151–172.

</small>

The Nguyen & the Vietnamization of Mainland Southeast Asia

Another strand of scholarship has studied the way in which–with the help of a modern state apparatus–the Vietnamese expanded their territorial administration southward into the Mekong.

  • Cotter, M. G. (1968). Towards a Social History of the Vietnamese Southward Movement. Journal of Southeast Asian History, 9(1), 12–24.
  • Cooke, N. (1998). Regionalism and the Nature of Nguyen Rule in Seventeenth-Century Dang Trong (Cochinchina). Journal of Southeast Asian Studies, 29(1), 122–161.
  • Li, T. (1998). Nguyen Cochinchina: Southern Vietnam in the Seventeenth and Eighteenth Centuries. SEAP Publications.
  • Taylor, K. (1993). Nguyen Hoang and the Beginning of Vietnam’s Southward Expansion. Southeast Asia in the Early Modern Era, 42–65.

</small>

Many thanks to the helpful discussion and input for the Vietnamese Studies Group listserv and my co-authors.

Read More

Quick Note: Using 'ff' to quickly save giant data sets in R

Using a really powerful library to do something simple.

For a current project, I am having to repeatedly manipulate and save a hundred datasets, each with about 4 million observations. While R tools like fread(), part of the data.table library, make it trivial to load massive data sets into memory, writing these data sets–and doing so repeatedly–is another story..

When trying to save big data sets, many of folks first recommend the write.table() function, which gives some people performance gains over the default write.csv() method. For my project, write.table() wasn’t cutting it and, instead, often crashed my system.

Enter the ff library, an incredibly handy tool for working with big data. In fact, ff nimbly deals with some of R’s serious memory issues (see a more technical guide here.). Instead of doing anything fancy with ff, however, I realized one could harness the library to simply save problematic data sets. The performance gains were striking.

A simple alternative to write.table():

Below I take a data.table object I was manipulating (perhaps in a loop), convert it into an “ff dataframe” (ffdf), which I can then save using ff’s speedy .csv file writing function.

library(magrittr) # For use of piping %>%.
library(data.table) # I use data.table to manipulate large datasets.
library(ff) # And the key package we'll use to save.

  # (Let's say I perform a bunch of data.table manipulations here)

  # Start with the data.table object,
  mygiant_datatable %>%

    # ... transform it into an ff dataframe,
    as.ffdf( . ) %>%

    # Write the ffdf object using ff's csv writing function.
    write.csv.ffdf( . , file = "/my/file/path/myfile.csv")

Note: the type of data table (mygiant_datatable) I was working with was quite simple, composed of only a few numeric columns. Thus, coercing the data into an ffdf object was no problemo.

Of course, in the cheap workaround above, the as.ffdf() adds a costly step (time-wise). However, it was well worth the benefit of utilizing the write.csv.ffdf() function … and worth not crashing ad nauseum.

Read More

Quick Note: A New Python IDE for Data Work (& RStudio Lovers)

Just a quick heads-up: I’ve been using Yhat’s new Python IDE, Rodeo, which is pretty awesome for data-centric work.

I’ve been looking for a Python-based environment that feels like good ol’ RStudio, an IDE for R that I keep returning to even over advanced text editors. Like RStudio, Rodeo crucially lets you preview active objects in your environment and view plots while editing away:

Read More

Digitizing & (Re-) Visualizing the Apollo 11 Landing

The Planetary DatArt blog digitized and evaluated the Apollo 11 lunar landing from some a collection of government diagrams, detailing the project in an interesting blog post.

Planetary Data.

Originally from the Apollo 11 Preliminary Science Report volume. With some custom perspective for the American audience added here.

<img src="/assets/a11_bbl.jpg" alt="Planetary Data." / width="70%" height="70%">

Paired with another multi-dimensional perspective given by Georgia Tech’s Yanni Loukissas:

Source: from Yanni’s Visual.ly page.

Read More

Tutorial: Big GIS Data in R & Functional Programming


Caption: "Thermal Map of North America. Delineating the Isothermal Zodiac, the Isothermal Axis of Identity, and its expansions up and down the 'Plateau' " From William Gilpin’s "Mission of the North American People (1873)." Via the "Making Maps: DIY Cartography" blog.

The Question

How do I take over 100 NetCDF files, each containing thousands of layers of hourly raster data, and translate into dataset?

I'll flesh out the problem more,

  • NetCDF files. We have over 100 NetCDF datasets: 1850weather.nc, 1851weather.nc , 1851weather , ... , 1971weather, ... etc..

  • Raster layers. Each NetCDF is a stack of hourly raster layers: 1850weather.nc: layer.hour1 , layer.hour2 , ... , layer.hour2000, ... etc.

  • Shapefiles. We also have a shapefile countaining country boundaries.

  • Our goal. Extract a giant panel dataset, containing average hourly weather data for each country.

In other words, we need a way to cycle through each NetCDF file and for every layer within a file, retrieve the average raster stats corresponding to each country in our shapefile. This adds up.

This problem is interesting because normal approaches to processing piles of raster data (using nested loops) takes forever. A more "functional" approach to the problem can be amazingly more efficient.

In this post I consider the ways in which we can tackle this problem (Part I.) and then motive a solution (Part 2.) for comfortably extracting statistics from gigs of GIS weather data.

Part 1. - A comparison of Two Approaches.

Consider two approaches to our problem, a conventional loop-based approach and an optimized approach.

Conventional Dead Ends with Loops.

Most of us would attack the problem using nexted loops. The first loop reads and prepares the NetCDF file; a second deals with the raster layers within each file.

for(i in 1:number_of_netcdffiles) {

   # Load file[i]
   # Do preliminary stuff to file[i]

	for(l in 1:number_of_rasterlayers) {

     # Extract geographic statistics from layer[l]
     # Add statistics from layer[l] to a data set

	}

	 # Save giant file for each NetCDF file

}

We're tempted to iterate over the raster layers ( layer[l] ) and use standard R GIS libraries to, say, extract everage values from the rasters over a country boundary shapefile (e.g. hourly weather readings over the borders of Finland). Above all, we're tempted to "harvest" the geographic statistics, taking the values we extract and collating them in a giant file.

Straight up, this is a bad idea.

Loops can be abysmal for these tasks. When iterating through objects with a for() loop, we're actually calling many tiny functions ... over and over again. Not only is the for() a function, but so is the ":", and so are the brackets "[ ]".

To make matters worse, when we manipulate a vector or data.frame with a for-loop, we're also making many internal copies of our objects. Unbeknownst to us, mundane data transormations can quickly fill out memory with repeated copies.

Moreover, embedded in our first attempt is the agony loop-based "data-harvesting." Unless we're carefull, using a loop to incrementally "grow" a dataset will bring your computer to its knees.

Consider an Alternative Approach.

Instead of loops, the following template - and specifically the full program in Part 2 - considers a solution more suited to R.

By combining functional style with the use of the raster() library, we eliminate the need for nested loops and boil the problem down to a streamlined "apply+function" structure.

generate_statistics_from_netcdf <- function( input_netcdffile ) {
	# Turn input_netcdffile into a "raster brick" ...
	# Get statistics from "raster brick" ...
	# Save statistics for input_netcdffile ...
}
lapply( list_of_netcdffiles , generate_statistics_from_netcdffile)

Instead of using a for-loop to iterate over NetCDF weather files, we take a list of files list_of_netcdffiles and "apply" a big function, generate_statistics_from_netcdf.

We eliminate the entire inner loop with the help of RasterBrick manipulations. That is, instead of looping over the individual raster layers within a NetCDF file, we transform the NetCDF into a RasterBrick object and manipulate the collection of layers as a single object.

Appealing to RasterBrick instead of cycling through individual layers feels a lot of like the practice of "vectorization", where instead of iteraring over individual members of a vector one-by-one, we work directly with the vector. Stylistically, this is a common line of attack for writing more efficient code.

Part 2. The Program.

Let's expand the alternative template above into a full program.

The first part of the program defines our functions: the main generate_datatable_from_rasterbricks function and a set of small sub-functions used within it.

The generate_datatable_from_rasterbricks functon eats a raw NetCDF file, and using a a team of smaller functions, reads it as a RasterBrick, aligns it with our country shapefile, extracts the country-level weather statistcs, then saves the output file.

The second part contains the core code. Here we define a list of NetCDF raster files (raster_file_list) and the country shapefile (countryshape) used in our calculations. An *apply function feeds NetCDF files through the generate_datatable_from_rasterbricks function.

Instead of using lapply I use mclapply: the latter is a multiprocessor version of list-apply provided by the parallel() package. Conveniently, mclapply utilizes the power of out multi-core processor (if we have one).

The third part of the program takes our saved files and assembles them into a giant file via the amazingly useful functions provifed by the data.table()package. With lapply(), our list of .csv files is opened with the speedy fread() function. A list big list of opened .csv files is then fed through rbindlist(), which combines them into a single massive data.table.

# ---- X. Header.

# The libraries we use.
library(rgdal)
library(raster)
library(ncdf4)
library(RNetCDF)
library(sp)
library(parallel)
library(magrittr)
library(data.table)

# Detect cores automatically, I usually free one up.
cores <- detectCores() - 1


# Define your file paths here.
weatherraster_path <- "/path/to/weatherrasterfiles"
countryshape_path <- "/path/to/countryshapefile"
output_path <- "/path/to/outputfiles"

# ---- 1. Define Functions.

# -- 1.A. Define Small Subfunctions.

# Small function 1) Reads filename & explicitly opens it as a NetCDF file.
open_netcdf_as_rasterbrick <- function( ncdf_filename_input ) {

  ncdf_filename_input %>%
  file.path( weatherraster_path , . ) %>%
  nc_open( . )  %>%  # Open path as NetCDF file.
  ncvar_get( . ) %>%  # Get NetCDF file.


  # Transform NetCDF into raster brick.

  # NOTE: Your dimensions and CRS will differ,
  # so these should be replaced.

  brick( . ,ymn = -0, ymx = 360, xmn = -90, xmx = 90,
             crs = "the string for your CRS" ) %>%
  return( . )
}

# Small function 2) Transforms the raster brick to our country shapefile.
match_rainbrick_to_countryshape <- function( brick_input ) {

  # NOTE: Depending on your setting and the nature of the shapefile
  # and NetCDF raster files you're using, you may have to do many more
  # manipulations to make sure the raster layers align with the shapefile.

  brick_input %>%
  # Reproject raster brick to the shapefile's coordinate system.
  projectRaster( . , crs = proj4string( countryshape ),
                   method = "ngb" ) %>%

  # Crop to match the size of my country shapefile.
  raster::crop( . , extent( country_shapefile ) ) %>%
  return( . )
}

# Small function 3) Extract data from a raster brick.
generate_data_from_rasterandshape <- function( brick_input ) {

	brick_input %>%
	# Take means according to the countryshape.
	# Make sure df = TRUE , so that output is a dataframe.
    raster::extract( . , countryshape ,
    				f=TRUE, fun = mean, na.rm = TRUE  )
    return( . )
}

# Small function 4) Grab 4-digit year from input filename.
grab_year_from_inputfile <- function( ncdf_filename_input ) {

  ncdf_filename_input %>%
  regexpr("[0-9]+", . ) %>%  # Match 4-digit year.
  regmatches( ncdf_filename_input , . ) %>%  # Get matched REGEX from input string.
  return( . )
}

# --- 1.B. Define "BIG" Function That Extracts Dataset From a NetCDF File.

generate_datatable_from_rasterbricks <- function( ncdf_filename_input ) {

	# Note: the only argument is a NetCDF filename.

	# Start with file argument and process with the sub-functions above.
	ncdf_filename_input %>%

	open_netcdf_as_rasterbrick( . ) %>%
	match_rainbrick_to_countryshape( . ) %>%
	generate_data_from_rasterandshape( . ) -> country_means_dataframe


	# Go back to the file input name, create automatic names, and save.
	ncdf_filename_input %>%

	grab_year_from_inputfile( ) %>%
	write.csv( country_means_dataframe ,
				file = file.path( . , output_path ) )
}

# ---- 2. Main Code: Setup Environment to Run Big Function.

# Start with your name of the country shapefile we're referencing.
"country_shapefile_name.shp" %>%
file.path( countryshape_path , ) %>%
readODG( den = .  , layer = "countries" ) -> countryshape


# Generate list of NetCDF files automatically from our directory.
# Match all files ending in ".nc"
raster_file_list <- list.files( path = weatherraster_path ,
pattern = ".nc" , all.files = FALSE , full.names = FALSE )

# Run our big function on the list of NetCDF files.
mclapply( raster_file_list , generate_datatable_from_rasterbricks )

# ---- 3. Assemble .CSV Files using Data.Table and Lapply.

# Fetch all files ending in .CSV in out output path.
csv_file_list <- list.files( path = output_path ,
                             pattern = ".csv",
                             all.files = FALSE,
                             full.names = TRUE,
                             recursive = FALSE )

# Take the list of saved files & "fast read" them into R.
lapply( csv_file_list , fread , sep = "," ) %>%

# Transform the list of read files into a data.table:
rbindlist( . ) -> big_datatable


# Note: Before reassembling the data, or after, you may want
# to manipulate the data so that it is in a more usable format.

# Note: You may want to setkeys() for data.table here.

# Save the big file.
write.csv( big_datatable, file = file.path( output_path , "big_file_name.csv") )

One thing to try.

Compiling functions with R's byte-compiler

Writing our own functions also allows us to easily use R's compiler() on chunks of our code. The cmpfun() function allows us to generate byte-compiled versions of our own functions,

c.myfunction <- compiler::cmpfun(myfunction)

Sometimes byte-compiled function can give our programs another performance boost, often with minimal upfront costs, and can yield surprising gains in big data projects without having to turn to Rcpp/C++.

Read More

Visualization: Oslo, Minnesota

Norwegian population settlement in the United States (and Canada) at the turn of the century.

Norwegian Population Settlement in the U.S.

From the "Locating Lutherans" project. Originally from (I believe), Olaf Morgan Norlie's "History of the Norwegian People in America Paperback" (1925).

On the subject of historic Scandinavian migration and its effect in Scandinavia, I highly recommend a recent job market paper by my colleagues Mounir Karadja and Erik Prawitz, "Exit, Voice and Political Change: Evidence from Swedish Mass Migration to the United States".

Read More

A Christmas with the Gipper & Nancy, 1983.

Merry Christmas with the Gipper. Nancy and President Reagan, Christmas Eve, 1983. From the Ronald Reagan Presidential Library, National Archives and Records Administration, via the Retro Campaigns blog. Read More

Tutorial: R Code Style for Empirical Economists

Make your code understandable. IBM's data center in the 1960s, Toronto. Source:ibm-1401.info/

Heuristics, Hunches, & Why the Heck We Care.

We hear many horror stories about big names having their results over turned over because of problems in our code. The struggle is real.

With this in mind, even simple rules of thumb used in programming can have large payoffs for social scientists. Especially since best practices are never discussed in graduate coursework. A few good norms can go a long way in making data-driven research reproducible, sharable, and readable by collaborators.

In this post I cover some coding norms used by clean coding gurus and R developers. While I am talking to R users, I’m much of this is generalizable to Stata folks and beyond.

Now, I’m not going to get advanced here, such as discuss unit testing or object-oriented specifics. Most of us are social scientists and aren’t developing applications. While “code-driven” research can learn a lot form from the craft of programming, our needs are bit a different. Gentzkow and Shapiro make a huge point in their recent work on big data practices for economists: if professionals are paying to do it is is likely important. However, I am not sure how much is practical for the social scientist.

I’ll go out on a limb: researchers probably ought to emphasize readability and reproducibility over writing slick code. The programming background of collaborators varies wildly, so understandability is a must. Seldom are we working with industrial scale projects. In fact, most big data people would probably laugh at what we consider “big.”

The Broad Stuff : humbling “bang for your buck” rules.

A lot of this will seem like plain common sense, of course. Then again, many of us never think to do it.

First things first, consider the clean code theorem, from "The Art of Readable Code."

A clean code theorem : "Code should be written to minimize the time it would take for someone else to understand it."

Consistency is key.

Consistency goes a long way in making a code readable. This applies to the naming rules, syntax, capitalization, white spaces, how we indent, etc.. This type of rigidity makes our work more navigable. And this consistency minimizes the “WTFs per minute” we face we staring into the black hole of code we wrote a year ago.

Comment & document like you’re at risk for a head injury.

It goes without saying, coding and documentation are huge. Many people who started off as research assistants have been admonished for not commenting enough. However, advice often stops there. However, any good programming book or style-guide spends a lot of time thinking about the way we document our work.

Comment often, but be brutally concise and to the point. Elucidate complex tasks and consider your audience. Since their backgrounds vary, seemingly simple tasks may have to be spelled out.

More is not always better however. For instance, comments easily become outdated. In fact, clean code practices in other domains can reduce the need for us to explain everything to the user. As in the case of naming, code can speak for itself.

More generally, document your work. Make documentation consistent feature of your script layouts, keeping updated descriptions in file headers.

Use meaningful, informative names.

Informative names make code infinitely more readable. In fact, smart naming forces us to think deeper about our code and reduces the possibility of errors.

Use concrete, descriptive words and avoid ambiguity. Nouns are used for variables—as well as for classes and attributes — and describe what they contain. Similarly, use verbs to describe functions and the action (hopefully singular) they perform. The names of script files explain what they do and end in capital R.

You would be surprised how much clean coding texts emphasize this. Consider an apt quote from the late-computer scientist, Phil Karton:

“There are only two hard things in Computer Science: cache invalidation and naming things.”

Try longer names, they hold more information and spare us from mysterious abbreviations. Contemporary code guides and R convention are moving toward long names. After all, solid IDEs — and new versions of RStudio — autofill variable names, reducing the cost of typing long names into out files.

Note: By variables I mean the objects in R/Python/etc., not to be confused by variable names used in the final output of cleaned data.

Layouts Matter.

Structure your script in a coherent, organized way. A lot of time spent thinking about the structure of code — as well as writing documentation — can save heartache.

Consider Google’s R style guide suggestion for layouts, much of which can be applied to Stata and other languages:


Copyright statement comment
Author comment
File description comment, inputs, and outputs
source() and library() statements
Function definitions
Executed statements, if applicable (e.g., print, plot)

Write D.R.Y. Code (Don’t Repeat Yourself).

Avoid repetition and duplicated code. The habit of pasting giant chunks of code is ubiquitous in economics. However, this practice is a cardinal sin among developers. Errors propagate and multiply. Fixing errors becomes complicated.

Consider a bastardization of the famous "rule of three", from Martin Fowler's seminal book on refactoring: First, we write code to get the job done. Second, we shudder and duplicate what we did. The third time, we think a little more deeply about how to rework (in coding parlance, "refactor") and streamline out code. In other words, ask yourself: can I generalize what I'm doing in a concise way?

Modularize.

Breaking code into re-usable, independent chunks will make code easier to read and debug.

Functions play a key role in modularization. Use them often, keeping them short and specific to a task. (Note: I recommend Cosma Shalizi’s notes on writing good R functions and the Clean Code github’s function tutorial ).

Limit your actual script files. Split them into two files if necessary. At minimum, you should divide analysis and data preparation. Jonathan Nagler of NYU Polisci. explains why:

"Separating data-manipulation and data-analysis is an example of modularity. ... The logic for this is simple. Lots of things can go wrong. You want to be able to isolate what went wrong. You also want to be able to isolate what went right."

Refine and Refactor.

Code should improve through time. Clean code gurus repeat a code of conduct adopted from the U.S. Boy Scout dictum:

“Leave the campground code cleaner than you found it”

-Bob Martin’s "Clean Code: A Handbook of Agile Software Craftsmanship".

Read More

1972 Thailand & Burma

null
null
Copyright the Nick DeWolf Foundation from their fantastic Nick DeWolf Archive Flickr page here.

I stumbled upon the Nick DeWolf Archive, a project from the Nick DeWolf Foundation, which housed a magnificent collection of photo dumps from his travels across Southeast Asia in the 1970s.

Read More

Wozniakian Destruction

"Wozniakian Destruction" - The History of Phone Hacking & its Influence on 1970s Silicon Valley

null
A bundle of blue boxes from phreaking.wikia.com

 
FiveThiryEight Signal's Series had a great small documentary piece on the history of phone hacking in 1960s-1980s, through the lens of Steve Wozniak and Steve Job's pre-Apple bedroom-based manufacturing of "blue bloxes," devices used for manipulating the nation's telecommunication infrastructure. The tidy punchline: "if we hadn't made those little blue boxes, there might never have been an Apple computer." Importantly, piece pays homage to the classic Esquire piece on the rise of analog phone hacking , the predecessor to PC-based hacking culture.
 

From FiveThirtyEight Science:

Read More

Quick Notes - Coding Stata do-files with Sublime in Unix/Linux

I am used to writing code in notepad programs, such as N++ and the fantastic Sublime Text 3. Here's a quick note on connecting a powerful coding notepad in Linux to Stata.

 

Sublime, like many of these programming-oriented editing notepads, have massively powerful tools that crush Stata's default editor. Moreover, since many people are simultaneously juggling Python, R, and Stata (and more) scripts for a single project, the ability to work from one programming-oriented environment is nice.

 

While it is straightforward to run Stata do-files from Sublime Text in Mac OS and Windows, using packages like Sublime Stata Enhanced, it wasn't obvious how to do so in Linux. The following is a little integration guide, which is indebted to this Github howto here.

Sublime, Stata & Unix Walk Into a Bar:

 

First, from your terminal create symbolic links for xStata and Stata commands. The gist of creating a link in the terminal is the following,

ln -s [target-filename] [symbolic-filename]
sudo ln -s /usr/local/stata14/xstata /usr/local/bin/xstata && sudo ln -s /usr/local/stata14/stata /usr/local/bin/stata
#[sudo will prompt you for your password]

Of course you can edit this to match the version of Stata (and flavor) you are using.
 
The following Stata package definitely works in Linux, so we'll use it! Download it from https://github.com/rpowers/sublime_stata .
 
Within the ZIP file from is a /Stata directory--find it and place it in the Sublime /Packages directory on your Linux system. If you're new to Linux, this file is likely in the folder /[your user name]/.config/sublime-text-3/Packages. Notice, sometimes these files are hidden from the user in the terminal so they may be hard to find. Confirm that the files appear with, typing ls -ld .?* in the command line.
 
Last, open the Stata.sublime-build file located in /.config/sublime-text-3/Packages/Stata/ directory. Replace all the text with the following,

 { "cmd": ["xstata do $file"], "file_regex": "^(...?):([0-9]):?([0-9]*)", "selector": "source.stata", "shell": true, } 

Seriously--just copy and paste over the stuff in the original text file. Save, restart Sublime Text for safe keeping, and you're good to go.
 
Now when you use Sublime Text, , simply typing ctrl+b executes Stata externally and runs the do-file you're currently editeing.
 
Note: for some reason I have run across some issues running do files in batch mode from the Unix terminal and such. I found adding an extra space at the end of my code, or a superfluous log close does the trick.
 
 

References:

  • Sublime Text 3: http://www.sublimetext.com/3.
  • Rhocon's github article for a similar approach: https://github.com/rhoconlinux/Stata-12-in-Sublime-3-under-Ubuntu.
  • State Enhanced for Sublime from rpowers (used on Linux systems): https://github.com/rpowers/sublime_stata
  • Symbolic links in Unix: http://faculty.salina.k-state.edu/tim/unix_sg/advanced/links.html
  • Sublime+Stata usage in Window and OSX: https://github.com/andrewheiss/SublimeStataEnhanced
  • Read More

    Data Janitors & Data Carpentry.

    Data Janitors and Data Carpentry: value in the nitty gritty?

    "Report on the investigation of engineer and janitor service, Board of education, city of Chicago" (1913)

    The fantastic machine learning-oriented podcast, Talking Machines, had an interview with computer scientist, David Mimno (also: his course syllabus on text mining for historians is awesome.). They spent some time discussing a recent essay by Mimno, that riffed off the New York Times "data janitorial" piece, arguing that data wrangling, data munging, or data janitorial work is not trivial grunt labor, but rather integral to the craft of research--especially in fields utilizing machine learning, etc.. Particularly, the intensive process of creating usable data sets is much less janitorial work and more akin to carpentry (a term already rolling through the data science lexicon):

    From "Data Carpentry,"
    Every data set has its idiosyncrasies. You can streamline the process, but you can’t avoid it. To draw out the analogy a bit more: sure, there’s Ikea, but the best furniture is still made by Amish carpenters.

    More broadly, on Talking Machines Mimno argues that knowing intimate minutiae of data---in the same manner than humanists know an obscure corpus of work or the finalities of administrative Dutch---has broader benefits to scholars. This intimacy can inform the questions we ask.

    I couldn't help thinking this view has application in economics, especially since I have been reading work from the great Zvi Griliches, who was a champion of the insights to be gleaned from the process of data collection. His emphasis on the value of data in economics is reflected in his 1994 presidential address to the American Economic Association:

    "We ourselves do not put enough emphasis on the value of data and data collection in our training of graduate students and in the reward structure of our profession." --Zvi Griliches

    With that in mind, it's hard to imagine many of Griliches insights, the least of which his work on productivity and technological adoption, without a granular appreciation weighing through the muck of data. You certainly get a sense of this from his interview with Alan Krueger.

    Read More

    Visualizing Interlocking Directorates, 1913

    Visualizing Interlocking Directorates, 1913
    From the St.Louis Fed. Fraser site.
     

    "Exhibit 243: Diagram Showing Affiliations of J.P. Morgan & Co., National City Bank, First National Bank, Guaranty Trust Co. and Bankers Trust Co. of New York City with Large Corporations of the United States" from the FED economic history blog. From the Money Trust Investigation : Investigation of Financial and Monetary Conditions in the United States Under House Resolutions Nos.429 and 504, Before a Subcommittee of the Committee on Banking and Currency

    Read More

    Geospatial Data in R

    Taking Geospatial Data to R (& how to ditch ArcGIS)

    For R users it's very straightforward to ditch ArcGIS (for most tasks) in favor of doing everything through an R script. There are many reasons to do this:

    • First, if you can do GIS work on your Linux system or Mac without having to run things through a lame emulator.
    • Second, you can cut yourself loose from dealing with the clunky ArcGIS licensing system.
    • Third, the GIS/R user community is pretty dang big, with a growing collection of resources and libraries.
    • Fourth, you can escape the mysterious, temperamental nature of ArcGIS and have full control over data outputs. Most quant folks I know try to minimize their time processing things on ArcGIS, outsourcing data as soon as possible to Stata or R. By working entirely in R lets you skip the murky black box of ArcGIS.

    All this means there are many reasons to dump ArcGIS--something I should have done before my pal called me out on twitter.

    Damnnnnnn

    Here are just some aspects on working with raster and vector data in R for those wanting to migrate from ArcGIS. Plus some tools that helped me with scripts to manipulate "large" data sets--say a couple gigs of raster data, etc..

    To get started working with GIS data, a couple of R packages cover most ArcGIS tasks. I'd install sp, Raster, rgeostats, maptools,and rgdal packages, which cover a surprising number of bases (Also: this is helpful to note if you're a Linux user).

    Starting with Raster Data

    Let's consider working with raster files first. You can think of loading GIS-based data just like you would any object, such as .csv file. Specifically, library( raster ) is enough to load raster-based images directly into R.

    library(raster)
    weatherfile <- "/home/user/population_raster.tif"
    rasterweather <- raster( weatherfile )
    popfile <- "/home/user/population_raster.tif"
    rasterpop <- raster( popfile ) 

    Crop to the size of Europe shapefile; the extent() function helps with this.

    rasterpop <- crop( rasterpopfile , extent( rasterweather ) )

    Above, I used the extext() function to automatically use the dimensions of another file in our memory. Since we're in R, we can easily save an extent to an object and re use it.

    One thing that ArcGIS has over R, however, is that it is based thoroughly on a graphical user interface and allows you to see multiple layers seamlessly. Nonetheless, the raster package (as well as staples such as GGPLOT2) allow you to eyeball and visualize GIS tasks. To plot an individual R layer, plot(rasterpopfile):

    Similarly, it is fairly easy to plot multiple layers. Of course there are all sorts of wacky things you can do to visualize GIS objects, but this is pretty much what you need to graphically verify nothing wacky is going on. Hence, manipulating GIS data programmatically in R doesn't mean flying blind.

    # Superimpose a rasters and vectors by using "add=TRUE"
    plot(rasterpop)
    plot(countryshape, add=TRUE)

    Moreover, it's pretty straightforward to perform common manipulations of raster data, such as changing the changing the CRS. One can change the projection by using the reprojectRaster() function followed by the resample() function.

    # Say we have another raster with a different coordinate system.
    # We can save this coordinate system using the proj4string() function.
    target_raster <- raster( "/home/user/target_raster.tif" ) 
    target_crs <- proj4string( target_raster )
    
    # Reproject using the projectRaster() function and the target_crs.
    re_rasterpopfile <- projectRaster( rasterpopfile, crs = target_crs , method = "bilinear")
    
    # Reproject using the projectRaster() function and the target_crs.
    re_rasterpopfile <- resample( re_rasterpopfile, target_raster , method = "bilinear")

    The first manipulation changes coordinate system of a current raster, changing it to match the target coordinate system; the second function is necessary so that the grid of the starting raster matches the grid of the target raster.

    Alternatively, you can easily specify nearest neighbor methods if you are working with categorical raster data. Now, both the target and starting raster layers should have the same resolution.

    While resample() allows us to align the grids of the two files, if the target raster is much more coarse--at a much lower resolution--we should use the aggregate() function, which lets us aggregate the cells of the fine raster to the larger raster; disaggregate() does just the opposite.

    Shapefiles & Vector Manipulations

    The rgdal package is fantastic for reading vectorized data and shapefiles into R. The package's readOGR() function is fantastic for loading shapefiles directly into R.

    Besides liberating yourself from ArcGIS wackiness, you can manipulate shapefile objects similar to the way you manipulate dataframes. This is because points, lines, and polygon shapes can be recognized as special SpatialPointsDataFramesSpatialĹinesDataFrames, or SpatialPolygonsDataFrames classes. Each type, or class, of layer contains an attributes table. An advantage of this is that you can use these attributes to select parts of the shapefile as you would select a subset of a dataset.

    The rgdal assists in loading vector-based GIS data into R, and comfortably handles ESRI shapefiles. The libary's readOGR() function is enough to get started. Here we load a file a standard shapefile of country polygons and create a shapefile layer for Sweden:

    library( rgdal ) 
    # Read in with readOGR since it preserves CRS projections.
    globeshape <- readOGR( dsn= "/home/user/countries.shp" , layer = "countries" ) 
    plot( globeshape )
    # Subset the European files. 
    swedeshape <- globeshape[ globeshape$COUNTRY == "Sweden", ]
    plot( swedeshape )

    Ingredients to Manipulating GIS Data En Masse

    Sure R is a free, programmatic solution to working with spatial data. Sure it also gives you transparent control over transforming spatial data. However, a big benefit of using R is being able to manipulate giant chunks of geographic data--and to do so in a way that is reproducible via a script.

    Work with Brick and Stack Objects

    RasterBricks and RasterStacks are your friend when working with big datasets. For instance, weather data often comes in the compact NetCDF format, where a common NetCDF file may contain hundreds of layers of daily weather data, each dimension representing geocoded raster data for a single day. RasterBricks are useful in this case, and store a multi-layered raster file in a single object that can be manipulated.

    With the ncdf4 library, you can load a 365 layer NetCDF raster file directly into R. Together with the brick() function (from the raster package), you can work with large, multi-dimensional raster files as if they were one single raster file. In other words, you can apply raster manipulations to a block of raster data directly by defining it as a brick (or stack() as well, though raster bricks and raster stacks are treated a bit differently in memory). This is handy when you want to resize, crop, or transpose an entire set of raster layers all at once instead of looping through each individual raster layer.

    For instance, using the ncdf4 library I can load a giant NetCDF file directly; together with the raster library's brick() function, you get load an entire NetCDF file and recognize it as a RasterBrick with minimal fuss

    library( ncdf4 )
    library( raster )
    netdata <- nc_open( "/home/user/rain.2015.nc" )
    netdata <- get( netdata )
    dailyweather <- brick( netdata )

    And we chop hundreds of layers to an appropriate size in one go,

    cropsize <- extent( eushape )
    cropped_dailyweather <- crop( dailyweather , crop_size )

    We can also multiply a multi-dimensional brick object by a single raster layer, effectively multiplying a hundred rasters contained in the brick with the singleton layer. I find this extremely useful for calculation population-based weights.

    weatherXpop <- overlay( crop_dailyweather , population_raster )

    Parallelization, foreach/plyr, apply, and data.table

    If you're trying to programmatically manipulate many files at once, you can speed things up tremendously with a a number of R libraries and features, especially those that support parallelization.

    For instance, the omnipresent plyr package and the handy foreach package allow you to parallalize time consuming manipulations of GIS data, especially GIS tasks that you would normally loop, such as computing repetitive calculations of zonal statistics and such. I'm not sure what geoprocessing tools are supported by ArcGIS' own parallel processing environment, but R certainly allows you to flexibly use the power of your multi-core processors for intensive tasks.

    Moreover, if you are manipulating GIS data and assembling the results into a dataset--e.g. `growing' a panel dataset of annual mean temperature readings across municipalities--the data.table package can be very helpful in speeding things along and reducing processes that are usually quite inefficient in R.

    Read More

    Soekarno and Khrushchev, 1960

    Indonesian President, Soekarno, and USSR's Khrushchev, 1960. Life Magazine.
    Indonesian President, Soekarno, and USSR's Khrushchev, 1960. Life Magazine.

    Read More

    Tutorial: Growing Datasets in R

    From the Field Museum collection

    Figure: From the Field Museum archives, 1920, Photographer Herbert P. Burtch, Oriental Institute. "Men moving Totem Pole outside Field Museum by train."

    Note: This was originally some notes to RAs but I figured it may be useful for other people out there.

    I've had some discussion with econ folks and RAs who are working with giant datasets in R for the first time. In particular, those having to "harvest" or "grow" unweildy datasets. R is notoriously slow when it comes to expanding datasets, such as when you want to increntally append rows to a file with results from a scraping API, or combine a giant stack of raw text files from another text mining project.

    The usual "good" method for concatination uses a do.call function with the rbind function. This method essentially takes a list of stuff and passes them as arguments all at once to rbind. In other words, you can take a list of data.frame names and bind the rows together in one motion: do.call("rbind", <<<A list of data.frame names>>>).

    A Usual Approach

    A common task I encounter is grabbing a chunk of files from a directory and combining them into a dataset. Such a task requires three steps. First, generating a list of files from a directory that match a pattern (e.g. all the .csv files in a directory) using the list.files() function. Next, looping over this list of files and loading them into R with with lapply, applying the read.csv() function to a list of files. Then, finally, using do.call() to rbind, or stack, all the loaded .csv files into a single dataset.

    # Grab the list of files in the directory "/home/user/foo" that end in ".csv"
    csvlist <- list.files( path = "/home/user/foo",
    				pattern = ".csv",
    				all.files = FALSE,
    				full.names = TRUE,
    				recursive = FALSE )
    
    # "Apply" the read.csv function to the list of csv files.
    csvloaded <- lapply( csvlist, read.csv )
    
    # Append the loaded .csv files into a list.
    dataset1 <-do.call( "rbind" , csvloaded )

    This is all great, but it can still take a ton of time. Note, I condense the lapply function and the do.call+rbind line into one.

    ptm <- proc.time()
    dataset1 <- do.call( "rbind", lapply( csvlist, read.csv) )
    proc.time() - ptm
    > user system elapsed
    > 48.840 0.148 50.241

    If you're doing more complicated tasks or working with large sets of data, processing time can balloon.

    A faster method.

    Using the data.table package can speed things along if we're trying to get big data into R efficiently (I highly recommend checking out the github for the project).
    The rbindlist function included in the package is incredibly fast and written in C. In addition the fread function is built to efficiently read data into R.

    Below I replace the normal read.csv function with fread(), and replace do.call()+rbind() with rbindlist().

    library(data.table)
    dataset2 <- rbindlist( lapply( csvlist, fread ) )
    proc.time() - ptm
    > user system elapsed
    > 4.044 0.084 4.144

    Both methods deliver identical datasets but there are some real efficiency gains when using fread and rbindlist from the super useful data.table package.

    identical( dataset1, dataset2 )
    > TRUE

    This can have pretty amazing payoffs when working trying to load massive data sets into R to process.

    Read More

    Deng Xiaoping in the U.S., 1979

    Deng Xiaoping in the US

    Deng Xiaoping speaks in D.C., the year relations between the People's Republic of China and the U.S. normalized.

    [Citation: via the Asia Society blog]

    Read More

    Hanoi, 1989

    David A Harvey of Magnum Photos

    Workers commute: Hanoi, 1989.

    [Photo: David Alan Harvey of Magnum Photos, 1989]

    Read More

    Useful new R programs from Stockholm.

    There's a lot useful R programming that comes out of the Stockholm University economics community, like Mahmood Arai's code for estimating clustered standard errors--small programs that go a long way in making R more comfortable for Stata-minded econometrics folks.

    Whelp, my pal Sirus Dehdari, a metrics guy and fellow Ph.D. candidate in economics, has some fresh code for producing both regression tables (with spatially correlated errors and other useful stuff) and regression discontinuity plots--in the vein of Outreg and Binscatter in Stata, respectively. Check out rddplot.R and rdd.R.

    Whew--the first post after 4 months of trauma after my computer crash.

    Read More

    Suicides & Churches in Early 20th Century Seattle

    From the Making Maps blog. Fantastic plots from old sociology research, originally published in Calvin F. Schmid “Notes on Two Multiple-Variable Spot Maps” Social Forces, Vol. 6, No. 3. (Mar., 1928), pp. 378-382.

    Read More

    Review: Ezra Vogel's The Four Little Dragons - The Spread of Industrialization in East Asia

    Singapore, 1967.  Location: "North Bridge road just after Capitol theatre." Copyright David Ayer.

     

    Sometimes the jacket covers of books are so dated that the obscure contents that still hold up in 2014. The Four Little Dragons was published in 1991, but the narrative still matters. Ezra Vogel's The Four Little Dragons is a comparative primer (lecture) that distills key narratives and lessons from four "Asian miracle" economies. Vogel skillfully narrates the rapid growth episodes of Hong Kong, Singapore, South Korea, and Taiwan in a slim volume. Remarkably, he synthesizes the varied experiences of these late developers, drawing key insights from a rich comparative setting. This guide is a solid starting place for those wanting to understand "what happened."

    While concise, and written in an almost aphoristic style, there is no shortage of ideas and bibliographic material. Western scholars can view the Asian growth miracle as one big blob of statist industrial strategies. But paces, policies, and timelines very, and Vogel's brevity amplifies differences among the four dragons. For instance, his juxtaposition of Singapore and Hong Kong will frustrate grand theorizers of the post-war Asian experience and those looking for crisp models of structural change.

    Vogel wrangles insights from these experiences, but you wont find a simple, grand explanation here. His final theoretic chapter is written in a chunky, pragmatic style, much like the preceding case studies, and highlights a number of key takeaways. Common patterns have bite: The post war era found societal hierarchies reshuffled and old elites were often supplanted. Each regime presided over industrialization with a sense of urgency; trauma undergirded the rise of the region's KMTs or PAPs and binding political threats loomed. Importantly, new political elites had a template for industrialization: Japan's Meiji-era. Finally, as a sociologist, as well, Vogel forcefully highlights the importance of the Confucian past in allowing the four dragons to produce competitive bureaucracies.

    In sum, Four Little Dragons is a great introduction to "what happened" across post-war Asia. Instead of a giant comparative tome or slick theories, Vogel deliveries the key issues any scholar of industrialization has to confront. For this reason, The Four Little Dragons is a great reading for popular economic readers or for a college/graduate course syllabus.

    The Four Little Dragons
    Published: 1991, Harvard University Press
    Author: Ezra Vogel

     

    Read More

    Micro Maps of Conflict & the Weak States of Williamsburg

    NYTimes infographics ahead of the times: a 1970s geography of Brooklyn street gangs.

    From this New York history blog and a spin-off piece from 1981.nyc & upcoming documentary, A Most Violent Year.

    Read More

    From Iceland with Love

    Read More

    Tutorial: Training an OCR Engine

    In a previous tutorial I covered the basics of digitizing old stats with ABBYY FineReader (& alternative digitization tools). Now, I dig into some important digitization nitty gritty: training optical character recognition software to properly read historical content.
     
    Most historical digitization projects will entail training. Old statistical documents often use long-gone proprietary typefaces. While modern OCR software can easily read Arials and Times New Romans, it needs help with more exotic typography; this is where training come in. If you have ever worked with machine learning-type projects and/or text analysis, training software to properly classify stuff is a familiar concept. And luckily, training ABBYY FineReader's engine is pretty easy.
     


    Gotta train 'em.

     
    Say you're working with an old scanned document and you wish to extract its tables. If you OCR the document using the default pattern recognition settings you will likely be disappointed. Below is an example of  historic document OCRd using FineReader's default recognition schemes; many numbers have been replaced by letters or strange characters.

    FineReader gone wrong.
     
    Training is used to tell FineReader, "Hey, don't do that. That British pound sign is really a number. Same with that W. Hey, don't do that." Anyone who has ever had to get their hands dirty with machine learning will immediately recognize the intuition. And if not, don't worry.
     
    Using custom User Patterns, we train ABBYY FineReader's OCRing algorithm to correctly classify characters in our historic data. Old documents will have many peculiarities that will be missed by FineReader unless we point the software in the right direction.
     

    Note: Somewhere buried in the sparse FineReader User Guide is a line reading, "oh no, you seldom have to train FineReader" etc., etc.. This left us scratching our heads. I have never worked with historical data that did not entail some degree of training.

    1. Setup.

    It is a good idea to select a handful (or more) representative pages from the document you wish to analyze. We will use these pages to train ABBYY's recognition capabilities.
     
    Save your "training pages" as a distinct file and open them in FineReader.
     
    Pre-process these test pages, but do not "Read" them yet. Instead, click the Options button on the main toolbar (or Tools > Options), and then click the Read tab.
     

     
    From the Training section, select either Use built-in and user patterns or Use only user patterns. I typically go with the built-in and user patterns options. You can (and should) experiment to see which is best for your job.
     
    In the Training section, click the Pattern Editor button. The Pattern Editor box will open. From the Pattern Editor box, click New. The Create Pattern box will open; enter a name for the User Pattern. Click OK.
     
    Last, click the Read with training box. This will put ABBYY FineReader in a type of "training mode." Click OK.
     

    2. Reading in Pattern Training mode.

    ABBYY is now in a zombie-like training mode!
     
    Click the main Read button.
     
    Instead of reading the document per usual, a Pattern Training mode box appears and ABBYY FineReader asks your advice about characters it is unsure about.
     
    Adjust the green frame around each character FineReader selects, making sure the box completely encompasses each character. The program is usually good about selecting the full character, but sometimes it needs some help.

     

    In many old documents, printed letters may be misread using FineReader's default OCRing pattern. In my document (seen in the image above), the letter D is always poorly printed. Without direction, the OCR engine systematically reads these letters as Us or Os.
     
    Confirm or correct the highlighted character using the Enter the character enclosed by the frame box and click Train. FineReader saves the corrected character and moves to the next one.
     
    Repeat the Pattern Training process for your training pages, until there are no more characters left to train.
     
    Important: If working with old data, FineReader often mistakes 1s for Is, zeros and letter Os, etc.. Hence, it is worth making sure that FineReader has been extensively trained to recognized your document's digits. These are potentially tedious errors to correct in raw output, so it is worth making sure that FineReader really differentiates things.
     
    People argue about the optimal amount of pages to train ABBYY on, but it will probably take at least two pages to train FineReader to read the document adequately. However, since we're dealing with a messy environment, we may wish to train many additional pages. It's really an iterative process.
     
    There may be gross characters you don't want ABBYY FineRead to store. For instance, in the image below the word "TABLE" is cut. Feeding ABBYY bad examples can diminish recognition. If letters/numbers are cut, use the Skip button.
     

     
    Err on the side of caution and skip weird stuff. For example, if you are uncertain what a letter/number is, also Skip it. It is better to skip a letter you are unsure about versus training FineReader incorrectly.
     
    Once you're done, Save all your hard work. Click the Option button and select the Read tab; click the Save to File under the Training and Language section.
     
    When you're finally done with training (and have checked your pattern, see Section 3.), Load the document you wish to analyze. Now when you press Read, ABBYY will use your User Pattern to in its OCRing process. Recognition should improve!
     

    3. Editing Patterns, Rooting Out Errors.

    Just like ABBYY, we make mistakes. Before we implement the pattern we trained, make your User Pattern is correct.
     
    Return to the Options window ( Options button or Tools > Options) and click the Read tab. Then click the Pattern Editor... button.
     
    The User Pattern box will open. In the training process we may accidentally mis-characterize letters. Browse the patterns we have recorded during the training process and root out typos. These typos are important to delete! We don't want ABBYY FinerReader's OCR engine to interpret characters incorrectly.
     
    If you find an incorrect character (for example, the L that has been mis-characterized as E below), click the letter image and press Delete. Be sure to really check for these types of errors, or any matches that seem iffy.
     
    Press OK when you are done, then save the corrected User Pattern.

    Read More

    The VC & ARVN, re-enacted.

    

    We're sifting through historic Vietnamese data and Pablo, my co-author, sent this along the way.

    Read More

    A new model of the labor market?

    Link: Watch a stampede of idiots endlessly run straight into a spinning metal thing.

    "Artist Dave Fothergill—who’s done visual effects for a bunch of National Geographic documentary films—has created a simple 3D animation of a crowd stampeding straight into a rotating metal thing. Watch as they all get threshed into a chaotic, flailing pile. It’s mesmerizing..."

    From Death and Taxes Mag.

    Read More

    Tutorial: Manipulating PDFs in Python (to Scrape Them).

    When digitizing old data, we often start with a pile of scanned documents we must reorganize. Much time is spent manually trudging through scans, deducing what variables exist, and selecting the tables we eventually wish to turn into machine-readable data. When you have hundreds of multi-page PDFS, this can be a painful experience. However, automating PDF manipulation with Python can save major time.

    The Problem

    We start with scans of old, provincial statistical yearbooks for a Southeast Asian country.

    Each yearbook page corresponds to a variable we want: page 1 has land statistics, page 2 has rice statistics, page 3 irrigation, etc..

    We want to reorganize this stack of yearbooks into something that is easier to digitize and organized by variable, not province. In essence (to the data scientist) we need to "reshape" our scanned PDF data.

    To restate the problem:

    • Have: Province x Variable PDFs

      Most old historical data comes in the following format: hard copy volumes organized by province, state, region, etc., each with the same set of tables (cough variables).

    • Need: Variable x Province PDFs

      And most of the time we want to pull certain pages from each geographic volume and create a new document for each variable.

    The Code

    The following is written for Python 3.* and uses the PyPDF2 package (which is a fork of the original PyPdf package), as well as the OS module for directory manipulation.

    The code starts with a directory containing our multi-page PDFs and creates a sub-folder to store individual pages, /splits.

    #The only two modules you need:
    import os
    from PyPDF2 import PdfFileReader, PdfFileWriter, PdfFileMerger
    
    #The directory of your (multipage) PDF files.
    start_dir = "/our/working/path" # Main working directory with PDFs to chop/clean
    
    #Make the following dirs if it doesn't exist.
    splits = os.path.join(start_dir, "splits1")
    if not os.path.exists(splits): os.makedirs(splits)

    Second, our stack-o-PDFs are read, chopped, and their pages are placed in (page) numbered folders.

    The following code chunk begins at the /start_dir, the file containing our original PDF files. We read each scan and then loop over its pages with the line, for i in xrange(in_file_pdf.numPages). Each page i is saved to a variable folder, corresponding to its page number: first pages are saved in start_dir/splits/1; second pages, into start_dir/splits/2 folder, etc..

    for filename in os.listdir(start_dir):
        #Run the following on PDFs only.
        if filename.endswith('.pdf'):
        #Show current multi-page PDF.
            print("Splitting "+filename)
    
            #Define input files, paths.
            in_file = os.path.join(start_dir,filename)
            in_file_pdf = PdfFileReader(file(in_file, "rb")) #(be explicit about binary file)
    
            for i in xrange(in_file_pdf.numPages):
                output = PdfFileWriter()
    
                #Make subfolder for each page, but only once.
                num_path = os.path.join(splits,str(i))
                if not os.path.exists(num_path): os.makedirs(num_path)
    
                #Add i page to output, define output path, save, close outputstream.
                output.addPage(in_file_pdf.getPage(i))
                out_file_pdf = os.path.splitext(filename)[0]+str(i)+".pdf" #Add i number to new name.
                out_file = os.path.join(splits,str(i),out_file_pdf)
                print("Saving "+out_file)
                outputStream = file(out_file, "wb")
                output.write(outputStream)
                outputStream.close()

    Third, after chopping and saving, we combine the separated pages into variable-based PDFs.

    The following code loops over each page folder (/splits/1, /splits/2,...). Using pyPDF2's PdfFilerMerge function, we combine pages within each folder into a single PDF file.

    Hence, the first page of each provincial yearbook is combined into a new file (i.e. the pages in /splits/1 become 1.pdf), which we can then scrape/digitize/pre-process/whatever.

    for root, dirs, filenames in os.walk(splits):
        for dir in dirs:
            merger = PdfFileMerger()
            dirname = os.path.join(splits, dir)
            print(dirname)
            for filename in os.listdir(dirname):
                print(filename)
                in_file_pdf = os.path.join(splits, dir, filename)
                print(in_file_pdf)
                merger.append(PdfFileReader(file(in_file_pdf, "rb")))
    
            out_file_pdf = str(dir)+".pdf"
            out_file = os.path.join(splits, out_file_pdf)
            outputStream = file(out_file, "wb")
            merger.write(outputStream)
            outputStream.close()

    Importantly, your project will probably look much different from this, but combining the OS module with the pyPDF2 package in Python can make many splitting/merging tasks trivial. Digitizing old data often entails mind-numbing file manipulation, so a little Python can go a long way.

    Read More

    On developmentalism, planning, & early big data in Allende's Chile.

    "In Allende’s Chile, a futuristic op room was to bring socialism into the computer age."

    Link: From the New Yorker - `The Planning Machine' By Evgeny Morozov

    ``The consultant, Stafford Beer, had been brought in by Chile’s top planners to help guide the country down what Salvador Allende, its democratically elected Marxist leader, was calling `the Chilean road to socialism.' Beer was a leading theorist of cybernetics—a discipline born of midcentury efforts to understand the role of communication in controlling social, biological, and technical systems. Chile’s government had a lot to control: Allende, who took office in November of 1970, had swiftly nationalized the country’s key industries, and he promised “worker participation” in the planning process. Beer’s mission was to deliver a hypermodern information system that would make this possible, and so bring socialism into the computer age. The system he devised had a gleaming, sci-fi name: Project Cybersyn.''

     

    Read More

    Tutorial: A Beginner's Guide to Scraping Historic Table Data

    This is a simple introduction to scraping tables from historic (scanned) documents. It is by no means definitive. Instead, this is a broad overview aimed at researchers with minimal programming experience tackling smaller digitization projects--say, nothing more than 200 pages.  I focus on OCRing material with ABBYY FineReader, a popular commercial program for OCRing. ABBYY has a relatively gentle learning curve and, importantly, straightforward table functionality.

     

    For those more comfortable with the command line and programming, or for open source advocates, I suggest free programmatic alternatives for each tutorial step. Larger complex digitization projects often entail more technical elbow grease and advanced use of such tools.

     

    Oh, the enraging heritage of old data. (From the Mad Men Mondays "Data's First Class Economy Set" Repository: Hartman Center, Rubenstein Library, Duke University.)

     

    First, Some OCRing Tips.

    • Try to work with scans that are at least 300 DPI, saved in TIFF format. PDFs are often unavoidable, but use less "lossy" formats when possible.
    • The older the text, the harder OCRing will be. This is especially true for text from the early 1900s and prior.
    • Cleaning images improves text recognition. Pre-process scans to remove stains and borders; fix page orientation; deskew; and normalize page illumination.
    • The straighter your page, the more likely programs are to recognize tabular content.
    • Experiment with different color settings for your project. People debate the efficacy of binarized (black-white) versus color formats. Explore what works best for your project, the benefits are high.
    • OCR quality suffers at very low and very high resolutions. If you scan at a higher resolution, drop the resolution before OCRing.
    • OCR software is poor at reading small text.
    • OCR software needs your help, especially for weird type faces: invest in training software prior to OCRing.

    1. Convert Scans: split PDFs & convert to TIFF.

    Most digitization projects don't start with clean TIFF files. They start with a nasty, multi-page PDFs produced from clumsy library scanning sessions. Before OCRing our scans, we must pre-process (batch clean) our images. But even before we can clean our documents, we must we must break apart our multi-page PDF and convert it to TIFF format. This is the format used by our pre-processing software, as well as by other OCRing tools:

    Using Adobe Acrobat Pro.

    If you have the luxury of Adobe Acrobat Pro, you can easily convert multi-page scans or a set of combined PDF images into TIFFS. (File: Save As > Image > TIFFs). Better yet, you can use their GUI to extract and export subsets of the PDF document to TIFF format.

    Splitting and converting a multi-page PDF to TIFF in Acrobat Pro.

    Open source alternatives:

    If you don't have access to Acrobat Pro, you're in luck. There are oodles of open source tools for breaking apart PDFs and/or batch conversion. ImageMagick is the workhorse open source tool for command line-based image manipulation and is incorporated into many digitization projects. It is cross-platform and can be integrated into most major programming languages. For OSX, Pdf-Splitter by Pro-Publica's Jeff Larson is a simple command line tool that utilizes native OSX libraries. Born from the frustration of dealing with PDFs, there are many Python packages for deconstructing documents. For instance, the pdfserenity package converts multi-page PDFs into TIFFs.

    The pyPDF2 package in Python is especially useful at manipulating PDFs, which I cover in this post here. Also, it's pretty darn fast!

    2. Pre-process TIFFS with ScanTailor.

    Pre-processing images--straightening documents, splitting pages, removing distortions, de-staining pages, etc.--can make or break character recognition. In the preceding step we converted our files so that we could pre-process our images using ScanTailor, an awesome open source tool specifically for batch cleaning scanned documents. It only accepts TIFF and does a limited number of basic automated tasks, but it does them well. Hence, it has become a staple of the digitization and hacker/text scraping community.

    While many OCRing suites, like ABBYY FineReader, also have solid pre-processing tools, ScanTailor has key advantages. I have found it is much better at straightening pages than most tools. Again, straight pages are crucial for recognizing table structure. ScanTailor also comes with a command line version that be used to script larger tasks.

    Batch pre-processing images in ScanTailor's GUI-based version.

    Other open source and programmatic tools for pre-processing.

    While most people use ImageMagick for basic image conversion tasks, many people utilize its powerful features for batch document cleaning scriptsGIMP, the popular open source graphic suite, also has promising batch pre-processing capabilities: people have had success with Nuvola tools for cleaning up greyscale scans.

    3. OCRing with ABBYY FineReader 12.

    Now that we have a pile of cleaned TIFFs, we load the files into ABBYY FineReader for further  1) pre-processing, 2) training, 3) table analysis/OCRing, and 4) error verification:

    First, use ABBYY's pre-processing tools to further clean and select the optimal OCR resolution.

    Once loaded into FineReader, you will likely want to further clean the scans using the built-in pre-processing tools ("Edit Image"). One tool that is particular useful is the FineReader's optimal resolution tool, which scales the resolution of the image to maximize recognition.

    But before you OCR, train.

    Training is the next crucial step. With historic data, you will likely get poor results if you neglect to train the OCR software and jump straight into OCRing. The ability to fully train OCR engines distinguishes professional software from less sophisticated OCRing tools.

    In general, training improves the ability of OCR algorithms to correctly classify characters by "tuning" the algorithm on a sample of your document. In FineReader, training is a trivial task, where you walk the program through recognizing a sample set of characters from your document. You can easily append and save these training files, called "User Patterns."

    Training ABBYY FineReader's OCR engine on a sample document.

    "Analyzing" & "Reading" - Table recognition & OCRing in FineReader.

    In FineReader, layout recognition and OCR are known as "analyzing" and "reading", respectively. Unlike straight forward digitization of textual material, we want to make sure FineReader recognizes our table layouts and correct mishaps before it reads the content of individual table cells: First, the Analyze Selected Pages command detects the content of our pages (i.e. finds our tables). We then confirm that FineReader has recognized tables and table cells correctly, adjusting mistakes "by hand" with the built-in table editing tools. Second, we OCR the table contents with the Read Selected Pages command.

    A properly recognized table in ABBYY FineReader with OCRd content.

    Check for mistakes, tweak, & repeat.

    OCRing is never perfect. Once FineReader has "read" your document, you will want to check for errors. For each page, FineReader gives a rough error rate, reporting the number of characters it is uncertain about on each document page.  The "Verify Text" tool allows us to easily check and correct each uncertain character.

    After the first OCR session it is best to get a sense of how successful character recognition was and the types of errors that occur. Often, exploring pre-processing and improved training can improve text recognition.

    5. Post-processing.

    Ultimately, FineReader will spit out .csv or .xlsx files, but newly digitized content still needs to be tidied up.

    Especially if you’re working with old documents, OCRing produces some junk output. Dust, scratches, and page discoloration can get picked up as weird symbols: *, ^, \, etc.. You can easily correct these blemishes using regular expressions in your preferred scripting language (using sub/gsub commands in R, re.sub type commands in Python). Better yet, OpenRefine provides some extremely flexible tools for wrangling OCRd output, making most cleaning tasks trivial while also supporting advanced regular expression use.

    Clean up weird OCR output using OpenRefine and regular expressions.

     

    Ending Note. A bit more on OCR software and getting advanced.

    Why ABBYY FineReader? First, the learning curve is lower than other open source options. The closest open source competition comes from Google’s tesseract OCR program, which is powerful and useful for those comfortable with the command line or OCRing from a preferred programming language. For advanced projects, tesseract offers unmatched flexibility and customization. However in my experience, the tesseract is frustrating to train and has poor table recognition ability.

    Although it is relatively easy, ABBYY FineReader has downsides. For instance, the Mac version isn't as functional as the complete "Professional" PC version. Moreover, multi-core support is limited for both Professional and basic Corporate versions, making large projects slow and unwieldy (in my experience).

    While FineReader provides tools for table area recognition, other times we have to pursue more programmatic methods of extracting table. Common approaches can be seen in Dan Nugyen’s ProPublica guide and Dr. Rex Douglass’ (UCSD Polisci) method, who use computer vision techniques to "cut up" tables, OCRing individual cells before reassembling the table. I recommend taking a peak at both to understand alternative workflows for table scraping.

    Some users have opted to detect tables after OCRing: first, recognizing text in PDF files and then stripping the OCRd content using PDF table extraction tools like Tabula. These methods hint to the growing hacker community interested in scraping PDF content. The recent PDF Liberation Hackathon website features some great tools to this end.

    Feel free to shoot me any feedback or share your experiences with digitizing historic data: nathaniel.lane@iies.su.se.

    Read More

    Text Analysis, (Non-) Experts, and Turning "Stuff" into Social Science Data

    Social scientists often use experts to "code" datasets; experts read some stuff and code that stuff using a coding rule (a right-left political spectrum, etc.). But this a slow, painful process for constructing a dataset

    Below is an awesome, succinct Kickstarter seminar on how to categorize (political) text using crowd-sourcing from quant-y political scientist, Drew Conway (http://drewconway.com/):

    The great lil' seminar is related to his recent working paper, "Methods for Collecting Large-scale Non-expert Text Coding":

    Abstract

    The task of coding text for discrete categories or quantifiable scales is a classic problem in political science. Traditionally, this task is executed by qualified ``experts''. While productive, this method is time consuming, resource intensive, and introduces bias. In the following paper I present the findings from a series of experiments developed to assess the viability of using crowd-sourcing platforms for political text coding, and how variations in the collection mechanism affects the quality of output...

    Read More

    A Deep Learning Bibliography

    A fantastic and extensive bibliography plus github cataloging deep learning resources/code/libraries, etc. from http://deeplearning.university. An amazing time vortex.

     

    Read More

    Political Economy and Rock n' Roll in Pre-Pol Pot Cambodia

    (A respite from data- - ) In the 1960s and 70s Cambodia was home an emergent rock scene, inspired by Western rock that was rolling its way into the region. The rise of the Khmer Rouge put a swift end to a subversive subculture. As in the clips below, youth culture isn't quite compatible with authoritarianism...much less a utopian regime attempting a hard reboot of society to "year zero."

    A new documentary, "Don't Think I've Forgotten: Cambodia's Lost Rock and Roll", has captured the full story:

    A nod to the archivist and digitization effort--and how (literal) records survived a regime that wiped out property and artists alike:

     

    Credits:

    Photo from: KI Media blog.
    Project, one of many: Don't Think I've Forgotten - Cambodia's Lost Rock & Roll

    Read More

    A great primer on cleaning OCRd data with Python & Regular Expressions.


    Link: Cleaning OCR’d text with Regular Expressions

    Often the pain of optical character recognition isn't the OCRing procedure itself, it is cleaning the tiny, little inconsistencies that plague OCRd content. This is especially true when we OCR historical material: even high quality scans can have a speckle or two that get recognized as gibberish.

    Adept use of Regular Expressions (regex) coupled with simple Python (or Ruby scripts--or heck, even Notepad++) can be a powerful means of removing nasty errors from OCRd text/CSV files.

    Here's an awesome little primer from Laura O'hare at The Programming Historian on using Python to clean nasty OCRd content using regexs. Great sample (verbose) code for helping turning mush into data. Importantly, they break down a lot of the regex components, which is helpful for those getting started with this brand of data cleaning.

    Of course, regex+Python won't be perfect. While there are preferred ways of using regex to wrangle text, Python gives most of us quick, programmatic means of cleaning nasty OCRd spreadsheets and the like. Most importantly, however, errors in our OCRd content are seldom systematic, which makes completely automating OCRd data cleaning tricky. There will be hand polishing involved. But as the primer notes, the point is isn't perfection; it's to let regex+Python "do the heavy lifting."

    Read More

    An Investigative Journalist's Guide to Geolocating Media


    Link: An Investigative Journalist's Guide to Geolocating Media

    bell?ngcat, a crowd-funded start-up of Middle East wonks, investigative journalists, and researchers, has made some waves for demonstrating how to geo-locate an Iraqi ISIS training camp.

    Here are a couple of fantastic guides to some of the techniques they use to geolocate media, combining picture/video data alongside common geographic tools. One of the interesting techniques used lately consists of extracting and mapping metadata from photographs using tools like Panoramio

    Read More

    From the Historical Times

    historicaltimes:19 inch color TV draws a crowd in Shantou, China. 1983

    Read More

    usnatarchivesexhibits:

    Letter from Senor Don Enrique Dupuy de Lôme to Senor Don Jose Canelejas, 1898.

    Item From: General Records of the Department of State. (09/1789-)

    This letter, written by the Spanish Ambassador to the United States, criticized President McKinley by calling him weak. The publication of the letter caused the public to support a war with Spain over the independence of the colony of Cuba.

    Source: http://go.usa.gov/DEA9

    Read More

    natgeofound:

    Two men struggle to free their scooter from a barbed-wire barricade in Saigon, South Vietnam, 1965.Photograph by W. E. Garrett, National Geographic Creative

    Read More

    Guides: DissertationReviews.org's Archives Reviews

    The Fresh from the Archives Guide - from Dissertationreviews.org

    [Photo: Poor students gazing into microfilm from the Special Collections Department, ISU Library]

    Read More

    From the US NARA tumblr.

    usnatarchivesexhibits:

    Berlin Wall Reinforced. Under The Watchful Eye of Communist Police, East German Workers Near The Brandenburg Gate Reinforce The Wall Dividing The City, 10/1961.

    Item From: Records of the U.S. Information Agency. (1982-1994).

    A signature of an era, the Berlin Wall symbolized the descending of the Iron Curtain. Construction on the Wall started on this date in 1961.

    Source:http://go.usa.gov/D5n5

    Read More

    theabsolutemag:

    Bikini Kill’s Kathleen Hanna reads the The Riot Grrrl Manifesto.

    Read More

    The data on white anxiety over Hispanic immigration

    The data on white anxiety over Hispanic immigration

    Read More

    Historic Aggregate Data for Korea, 1910-1945 -- and beyond.

    Historic Aggregate Data for Korea, 1910-1945 -- and beyond.

    Read More

    A bibliography on land, peasants, and politics for Malaysia, Indonesia and the Philip­pines

    A bibliography on land, peasants, and politics for Malaysia, Indonesia and the Philip­pines

    Read More

    Dataset: Philippine Municipalities Created by Executive Order

    Oh boy. Doing panel econometrics and economic history in developing countries is awful. One problem, ALWAYS, is when we want to match historical data to contemporary data. Doing so entails historic shapefiles or datasets that allow us to track administrative boundary changes; these things are seldom available for places like the Philippines. 

    So here is a little dataset that lists new municipalities created by executive order (only) from 1936-1965

    Philippine Executive Order Municipalities, 1936-1965 - Table on Mode.

    Note this does not list the “mother” municipalities. 

    Read More

    A great little bit on the practice/work style of my friend, Erin Riley, a textile artist. (http://www.emptykingdom.com/featured/one-piece-of-art-erin-riley/)

    How do you start your day? Is there something you do every morning?

    I usually just make sure I have a decent amount of coffee and water within reach and make sure I have all the colors I need to work for a while. I turn on all the lights, fans, take my vitamins, check my email and then get to weaving.

    Read More

    Non-parametric Econometrics and Quantile Regressions Online.

    On the tail of some cool new econometric papers are a couple of cool new Stata programs. 

    http://froelich.vwl.uni-mannheim.de/1357.0.html

    Read More

    The Bastards Book of Regular Expressions by Dan Nguyen

    The Bastards Book of Regular Expressions by Dan Nguyen

    Read More