I gave dpxdt, my perceptual diff tool, some much needed love today. Feels good to be gaining momentum again. After my final commit of the night I searched around for "perceptual diffs" as I do. I came across a similar tool called Diffux that was released by Causes back in February. Somehow I totally missed it! In their announcement post they wrote this:
Before deciding to build Diffux, we scanned the open source market for some alternatives. Dpxdt looked promising, so we gave it a spin. It got the job done, but the project looked abandoned (6 month old PRs hadn’t received any attention, last commit was in August 2013) and we couldn’t get the test suite to run locally. Plus, Dpxdt is written in Python, and we are no Python experts. So there was a bit of a hurdle in debugging and adding functionality.
This is the kind of thing that bums me out. I wish they had sent me an email or something.
For a year I wrote a lot of code. I added features and fixed bugs. I merged contributions from others. I made it easy to deploy to production. And it debuted at Velocity. But what Causes wrote is true. I didn't enhance the project for 6 months after last summer. What can I say? I've been busy. I'd like to blame GitHub for never sending me notification emails. But that's lame.
The truth is I am completely responsible for not making forward progress. I can't be mad. I just wish I had done a better job of maintaining the project.
This is one of the frustrating parts of open source. It's hard to team up with others across perceived boundaries. Yet another example I saw recently is Chef vs. SaltStack. What's the difference? They both do automation. Sure, they have different customers and different architectures. But the obvious difference is Chef is for Ruby people and SaltStack is for Python people. That's all there is to it sometimes.
Anyways, I'm happy to see more perceptual diff tools out there! I look forward to when we all take it for granted.
This week I helped a friend of a friend understand the reality of managing a team for the first time. I mentioned a few things about productivity offhand they found useful. Reproduced here is how I handle the onslaught of incoming email:
1. All communication must be on mailing lists to create a body of searchable knowledge and overcome the bus factor*.
2. Never reply to an email if anyone else on the thread also knows the answer.
3. Always reply when you have information that nobody else does.
4. If something is important they will email you repeatedly, IM, call, show up in person, etc.
5. Worst-case: Wait a day (or week, or month) and finally reply to an email yourself.
If I didn't do these things I would never find time to design, review, write code.
* Direct emails for sensitive things are fine, but that's the only exception.
Over the years I've seen attempts to solve what is called the "data fusion" problem. What is that? You have one useful dataset. You have another useful dataset. The goal is to somehow merge them together to create one larger, unified, and more powerful dataset. Sounds awesome! The problem is the two datasets are disjoint and thus have no overlapping sources. There is no simple key with which to join them together.
So can it be done? Let me show you with a simple example.
Imagine you have 3 variables and you want to measure their correlation: X, Y, and Z. For this example let's say X is owning a car, Y is playing golf, and Z is traveling every week for work. Our hypothesis is these have a high correlation.
Say you send 3 separate surveys, A, B, and C, to different groups of random people to measure these variables.
Survey A is like this (X and Y):
Q1. Do you drive a car? [Yes | No]
Q2. Do you play golf? [Yes | No]
Survey B is like this (Y and Z):
Q1. Do you play golf? [Yes | No]
Q2. Do you travel weekly for work? [Yes | No]
Survey C is like this (X and Z):
Q1. Do you drive a car? [Yes | No]
Q2. Do you travel weekly for work? [Yes | No]
Survey A gives you correlation of Car and Golf, variables X and Y. Survey B gives you correlation of Golf and Travel, variables Y and Z. Survey C gives you correlation of Car and Travel, variables X and Z. That leads to this question:
With datasets for correlation of XY, correlation of YZ, and correlation of XZ, can you calculate the correlation of XYZ? This is exactly data fusion problem. The answer is:
No, you can't. Here's why:
You haven't measured XYZ. How do you calculate it? How can you put boundaries on its size? There are actually 8 set memberships you're trying to determine:
You know none of these.
You could assume a uniform distribution of Z in the set XY. Assuming Z (Traveling) is split Yes/No as 40/60 in the general population (the red circle), then also assume it's split 40/60 in the Car & Golf population set (the green section, XY). That sounds reasonable, but there is no way to actually calculate an error boundary on that assumption. You have no idea what the interior of XYZ looks like. It could be a "rogue wave" of correlation, where the distribution of Z (Traveling) in the set XY (Car/Golf) is perfect and the correlation of XYZ is 100%. It could just as easily be the opposite, where the correlation of XYZ is 0%. You have no way of knowing. All of the data measurements you have collected cannot reveal any pieces of the XYZ interior.
Thus, you must assume the error boundary on XYZ is 100%. There's no way to calculate otherwise. If you want to calculate XYZ, you must measure XYZ. No modeling or bias correction can compensate for this. There are two outcomes in data fusion: you measure so you can calculate the error bars, or you make a wild guess.
PourOver is a library for simple, fast filtering and sorting of large collections -- think 100,000s of items -- in the browser. It allows you to build data-exploration apps and archives that run at 60fps, that don't have to to wait for a database call to render query results.
Tip: If you aren't getting GitHub notifications properly, switch your primary email account to something else and then back to the original again. This kind of "solution" does not inspire confidence in a platform.
Numba is an just-in-time specializing compiler which compiles annotated Python and NumPy code to LLVM (through decorators). Its goal is to seamlessly integrate with the Python scientific software stack and produce optimized native code, as well as integrate with native foreign languages.
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano features: tight integration with NumPy, transparent use of a GPU...
Recently I was at one of my favorite pizza places on Earth, an old family style establishment in Lower Manhattan.
There were a few slices left over, and upon returning with them wrapped in foil our gracious host dropped this knowledge on me:
When the pizza arrives, it's hot: On the top you have molten cheese; below that is thin sauce; then you have soft warm bread followed by that nice crunch of the crust. But as it sits there and time goes by, the cheese firms up, the sauce thickens down into the bread making it soggy, and the crust softens.
When you get home you need to reverse the process.
Get a nice skillet and heat it up. Put the pizza down. You can use butter or oil, whatever. Now think about it: As it heats up from below the crust will get crunchy again; the bread will heat up and soften; the sauce will rise out of the bread and thin; the cheese will heat up and melt. You'll be back where you started.
I followed this man's directions and there's no doubt he's right. This was the best reheated pizza I've ever had. How have I gone a lifetime without knowing this method?
(PS: Yes, I'm paraphrasing his words. I wish I could have recorded the original)
Streamtools looks like an interesting visualization of data pipelines. But I don't think graphical programming for this makes sense. So much of data analysis of any kind is "cleaning" the data to be sure you're counting the right things. That's usually the ugliest type of code you can write.