Control Your Content
Tumblr’s stability issues are the perfect example of why it’s important to take control of your data.
We, like most, use a variety of web apps ranging from twitter to tumblr for self-promotion, sharing and networking. We rely on these third parties not only for their services but also their expertise in areas we aren’t experts in. From day one of creating this website, we knew we want to share as much of our web activity as possible but what happens when one of these services is offline temporarily or permanently. Internally follow a simple rule when it comes to the information we put online—control.
Custom Apps
There are two easy answers to why we don’t build our own custom clients for the web apps we use. First reason—time. Building clients takes time; hours learning APIs, hours deconstructing features, hours writing import scripts and so on. Even with unlimited time, our second reason is the kicker—the apps are good at what they do. We’re the first to admit that we can’t do everything and we still have lots to learn which is why we yield to the experts and use their products. Most apps are thoughtfully developed with countless hours going into even the smallest details and for good reason, it is typically their sole business.
Open Source to Gain Control
Lucky for us, it seems that in order to be considered a mature web app, an API is almost required. APIs allow us to read and sometime write data to these services with relative ease. Even with open APIs, there is still plenty time required to learn the best practices for each service. Whether rate limits or data structures, mastering a single API can take months or even years. Thankfully open source software can step in and handle most of the heavy lifting for us.
Bringing Home the Data
For this very site, we are using around five open source projects solely for the purpose of gathering all of the data we enter into twitter, tumblr, flickr, delicious, and youtube and storing it in our own private database.
Django-syncr, a library for ‘synchronizing Django with the web’ from Jesse Legg, gave us the perfect starting point for gathering most of our data. Out of the box, it provides the importing of roughly 10 popular web apps. Django-syncr uses flickrapi, python-twitter and because of twitter’s new oauth requirement, we had to add python-oauth2 to the mix. We’ll get into the details of adding oAuth to django-syncr later but the process was relatively simple.
Although django-syncr was getting the job done, we wanted the flexibility of using tumblr so we also employed djumblr. Djumblr‘s only dependency is python-tumblr and although it does not include comments, likes or reposts out of the box, we plan on contributing and adding that functionality.
For both django-syncr and djumblr, we’ve added a simple cron job to execute their respective import scripts. A simple crontab -e command into terminal lets us add out jobs.
6,26,46 * * * * PATH/djumblr/scripts.py populate_models > /PATH/djumblr_cron.log 2>&1 8,38 * * * * PATH/syncr_updater.py > /PATH/syncr_cron.log 2>&1
If your unfamiliar with writing cron jobs, here is quick breakdown of what the code above actually does. The first group of numbers in both of our jobs represent the minutes when each should be executed. So the synrc_updater.py script is executed on the 8th and 38th minute of each hour. The asterisks’ each represent a single field (hour, day, month, day of week). By using an asterisk for each, our script will run every hour, every day of the month, every month, and every day of the week. If we wanted our script to only run on every other tuesday’s at 6pm our syntax would look like:
0 17 * * 2 PATH/script_name function_to_execute > PATH/log_file_name.log 2>&1
Pretty simple really. As you can see in our use and the example above, you can pass the job a specific function with a script or run the entire script and all of the functions within it. By default, whenever a job is executed, the root user of the server will receive an email letting them know the command was executed. Since we all get more than enough email as it is, we opted to have the job create a log file with any errors. We’ve appended ‘> PATH/log_file_name.log 2>&1’ to our job as you can see above. If you prefer receiving emails then just remove it or if you don’t want anything to happen after a job is executed the change it to ‘> /dev/null 2>&1’. We periodically check our logs so for us it was the least obtrusive but still viable option.
Own Your Data and Live Forever
It would be easy to use the ‘widgets’ that most web services provide or even a feed parser to simply display the data we input into these services but there are obvious limitations to doing so. If a service is offline, like tumblr has been for the past 18 hours today then your data will simply not be accessible. Or maybe something like the recent GitHub outage hits your favorite service, you are then relying on their backup practices in hopes that your data isn’t lost forever. Do yourself a favor and pull all of your data into a single place in which you have total control, one day you will be thankful.
Comments are closed for this entry.
We rarely do not take comments but for this entry we have decided not to. Instead, feel free to email us if you have something pressing to say.
