TumblTwo - An Improved Fork of TumblOne, a Tumblr Downloader

Introduction

Note: New users should directly check out TumblThree.

Over the last week I reverse engineered TumblOne by Helena Carver. The project is under Public Domain and thus free from copyright. Since there was no source code available and I always wanted to see how easily one can decompile .NET assembly and wanted to add new features, I thought I would give it a try and reflect it. Bonus: I've never touched C# before and this way, I could learn a new language on top.

There are other people on the projects discussion page that suggested similar features and since the development seemed stalled, I thought about releasing the code and the binary with my changes, thus probably also under public domain.
I don't want to take over the original project, nor infringe any copyright or claim fully authorship. So, if the original author wants to continue her project, I'd be happy to help and see my changes committed. For the meantime, I thought of a fork for the changes and a new project name.

TumblTwo, a TumblOne Fork

TumblTwo is an image downloader (crawler) for the Bloghoster Tumblr.com based on TumblOne. After supplying a url, the application will search and download all types of images in a given resolution. It's possible to download only tagged images and download simultaneously from multiple blogs and enqueue others.

Screenshots:

Main UI, showing a list of blogs an top, the current queue status in the middle. On the right side are the control for managing the blogs and the crawl process:
TumblTwo Main UI

Program Usage

  • To use the application, simply copy the url of any tumblr.com blog you want to download the pictures from into the textbox at the top. Afterwards, click on 'Add Blog' on the right.
  • To start the crawl process, click on 'Crawl' on the right. The application will regularly check for (new) blogs in the queue and start processing them, until you stop the application by pressing 'Stop'. So, you can either add blogs to the queue via 'Add to Queue' first and then click 'Crawl', or you start the crawl process first and add blogs to the queue afterwards.
  • You can set up more than one parallel download in the 'Settings' on the right side. Also, it is possible to change the download location and the sizes of the pictures to download there.

Tags

  • You can also download only tagged images by adding tags in a comma separated list in the tag column of the blog list in the top. For example: great big car,bears would search for images that are tagged for either a great big car or bears or both.

New Features

New Features (over TumblOne):

  • multiple simultaneous picture downloads of a single blog, customizable in the settings. As an alternative, each picture is downloaded successively.
  • multiple simultaneous downloads of different blogs, customizable in the settings.
  • possible to download tumblr.com hosted videos.
  • it is possible to download images from blogs only for specific tags.
  • a clipboard monitor that detects http:// .tumblr.com urls in the clipboard (copy and paste) and automatically adds the blog to the bloglist.
  • a download queue for blogs.
  • a detection if the blog is still online or the owner has changed.
  • the blogview is now sortable and shows more information, e.g. date added, last time finished and the progress.
  • a settings panel (change download location, turn picture preview off/on, define number of simultaneous downloads, set the imagesize of downloaded pictures).
  • Somewhat overhauled user interface which is resizable, faster and saves and restores its settings.
  • Source code at github (Written in C# and WinForms).

Changelog

2016-06-10:

  • Support for tumblr.com hosted videos. Check the settings window to enable video download (default: off).
  • This is probably going to be the last release.

2016-05-24:

  • New icons as it has been requested by the author of TumblOne.
  • I am not the author nor owner of the website tumblone.com and not responsible for the content of this particular site.

2016-04-09:

  • Support for urls starting with https: instead of http:
  • The image preview now applies its visibility settings upon startup.

2016-04-04: Code Refactoring

  • Started my complete code rewrite in C# using WPF and MVVM pattern. Most things are already done and set up but not debugged yet. Some converters for the UI are still missing. New features will be:
    1. Better and modular code!
    2. Internationalization support
    3. A blog rating system
    4. Save and restore, clear queuelist
    5. Movable items in queuelist
    6. Taskbar buttons and progress indicator
  • Maybe it's possible to add support for new websites now, and it's certainly possible to add video support for tumblr.com hosted videos without a big hassle. CLI support and at some point i'm planing a mono gtk# UI for linux support. A screenshot showing the current state:
    TumblThree - Core rewrite featuring C# and WPF with MVVM pattern.

2016-03-11:

  • Since we have to pre-crawl all image urls for the parallel image downloading, we now use its count for better progress indication instead of the total blog post count, which might contain double posts of the same image (seems to happen a lot), text, videos, etc..

2016-02-28:

  • Version bump: Version 1.0.7.
  • Some images were reloaded even if they were already downloaded since we have saved the full url and checked for duplicates using it. If the file however was hosted on a different mirror, the application would redownload the same file and increase the counter for downloaded images even if it was already downloaded.
  • Finally, the program should work quite nice for everyone now I hope.
  • Half finished mono release for Linux. It just runs and downloads, .. [GPG sig]
  • Next Steps:
    1. I am going to upload a "mono" version for linux in a few hours/days without the clipboard monitor as it relays on windows 32 apis which seems to break the application. All the other stuff seems to work, after all the path handling in the code as been sanitized, thanks to \ and / in Windows and Linux, respectively
    2. Get rid of the progress indicators in the button as they are too troublesome for multiple blog downloads and provide similar or better information in the blog list at the top.

2016-02-27:

  • Set a maximum degree of parallel downloads to prevent connection timeouts and connection closures from tumblr.com which appeared on my site after around 6,000-10,000 images. The crawl seemed stalled after a while, then finished before downloading all images. I've set the value to 50 20 and they are divided by the number of parallel blog downloads in the settings. When you crawl multiple blogs at once, you might have to adjust this value in the settings as it depends on your bandwidth. Thanks for the email regarding this issue!

2016-02-26:

  • Further integrated the versions. The parallel crawl is now the default and integrated in the main TumblTwo.exe. You can switch to the old, serial download method in the Settings.

2016-02-25: stable releases

  • Integrated the beta (tags) version into the main version, so no more "beta" right now. You add tags in the main blog list as comma-separated list e.g.: great big car, bears would search for images that are tagged for either a great big car or bears or both. Tags are saved and get reloaded if the blog was crawled for those once. Just clean the tags column to search for all images again.
  • Clicking the picture preview in the bottom right corner opens a fullscreen preview. Upon clicking it, the normal view returns.

2016-02-24: stable releases

  • I did a two day code cleanup to enhance and someday remove the wonky User Interface. The new versions aren't compatible with the previous ones yet.
  • the blog data is now mostly updated automatically without me messing around doing so manually. Thus this should greatly improve accuracy, amount of errors and remove lag
  • the blogview now saves the column order, width and so forth. The Columns can be reordered.
  • the blogview progress is now under layered with a progressbar.
  • Probably more i already forgot. I am going merge the tags (beta)-version into this one and someday will come up with a new interface after reorganizing the code further. I just thought this might be a good intermediate version/step (for newcomers) as the UI should be more stable now, and the old versions are still here for download (since the data files are not compatible yet).

2016-02-22: all releases

2016-02-19: all releases

  • It's now possible to download photosets.
  • Added a detection if the blog is still alive and/or if its the same blog. Therefore we use the HTML Title and the blog description. I wasn't sure if the title would be enough, since many blog titles are simply equal to the url, which might not change if the owner the blog changes. Thus, I'm also taking the description into account, but I'm not sure if they frequently change. So, I'm happy about any input in the comments/per mail about this if we're generating too many false positives.
  • A more parallelized version for single/few blog downloads can be found here: Windows Application (.exe) (~248 kb) [GPG sig] - Windows Application (.exe) (~248 kb) - Beta [GPG sig]. I haven't yet much time to test it, but maybe it's worth a try if you don't download multiple blogs at once. The picture preview might lag/show nothing and the "stop/pause" won't come at once since we "batch" download up to 50 images, but otherwise it should work.

2015-11-23: all releases

  • Allows Column Sorting
  • Added a process percentage column in the Blogview (no fancy progressbars yet).
  • "Delete Blog" now deletes only the index file and removes the blog from the view, but does not delete any downloaded images.

2015-09-08: all releases

  • It's now possible to import TumblOne-Blogs by simply addind/moving the proper .tumblr files from the Index folder of TumblOne (which is also located inside the \Blogs\ folder which holds your downloaded pictures right next to where the TumblOne.exe is located) into the Index folder of your download location set in the 'Settings' window in TumblTwo. The blogs will be added but will be showing a "not yet crawled!". Thats okay, because we use a different counting mechanism. After starting the first crawl, the proper index will be adjusted.
  • For an update on video / larger image support, see here

2015-09-01: stable and beta release

  • Removing a blog is now always possible and does not result in a reload of the whole library (not sure, why this was implemented in the first way.)
  • Some fixes for the progressbar.
  • Some minor UI code changes and cleanup.

2015-08-28: stable and beta release

  • Added a Clipboard Monitor. Enabled by default, can be turned off in the mainwindow on the right side panel. Once turned on, if you ctrl-c or copy any text which contains one or more Tumblr blog urls, the blogs will be automatically added if they don't exist.
  • Disabled useless startup splashscreen.

2015-08-27:

  • Beta release (Not really well tested yet). Crawl only specifically tagged images. Crawl only specifically tagged images by specifying the tags in the Queue Window in a comma separated way. I.e: Aston Martin,ferrari,Porsche. Consequently, the Blog is crawled for any image that matches the given tags. To do so: add the desired blog to the queue, without starting the crawl process. Now click in the cell next to the blog with the column header Tags for crawling. Enter your tags in a comma separated way, finish with enter. Start the crawl. If you don't bother about tags, simply don't add anything to crawl the whole blog.

2015-08-26:

  • Fixed threading wonkiness. Sometimes, the queue still got depleted from idling tasks after pressing 'stop'.
  • Adjust the number of simultaneous downloads without a necessary restart of the application, if the number of threads is not smaller than it was before and if the crawl process is not currently running.
  • Make sure the download location is always correct (trailing backslash) to fix the "The current blog cannot be saved to disk"-bug.
  • Now saving the windowsize and its position.

2015-08-25:

  • Large Speedup for startup times and resuming of blogs since we now catalogize all downloaded filenames together with their URLs in a small single index file, instead of checking for all single downloaded image files in the download folder at startup and which is now also used for duplication check. This should improve speed drastically. Also, its now possible to safely remove images out of the \Blogs\"MyDownloadedTumblrBlogFolder"\ without rendering in download them again, as long as you keep the .tumblr (index) file in the Index folder. This opens the way to a backup function.
  • Specifying the number of Posts in each blogs. Might be equivalent to number of pictures, if the blog only contains pictures.

2015-06-04:

  • Added Multiselection in the Blog and Queue View.
    To add multiple blogs at once to the queue, select the blogs with the ctrl-key or shift-key pressed, then hit "Add to Queue". Same for removing, just in the "Queue"-view and hit "Remove Queue" (Thanks to Torn for suggesting this!).

2015-04-08:

  • multiple simultaneous downloads
  • a download queue
  • a settings panel (change download location, turn picture preview off/on, define number of simultaneous downloads, set imagesize of downloaded pictures)
  • the tumblrlist now features columns for 'Date added' and if and when the blog was completely crawled
  • saves and restores settings
  • resizable UI

Possible next Features (ToDo-List):

  • prevent downloading "Image has been removed" / same images
  • add a 'expiration date' to crawl only newer images in specific blog -> Partly done: You can simple recrawl all blogs, as long as you keep the Index (.tumblr) files, only newer images will be downloaded, since all images (the download url and the filename) are catalogized in the index file. No redownload occurs.
  • option to automatically remove blogs when crawling is complete.
  • batch input of tumblr blog urls from text file -> We check the Clipboard for URLs now. Simply ctrl-c your text file.
  • import blog index files from TumblOne.
  • 'backup function' for blog indexes -> Check your Downloadlocation\Index\ folder and save the appropriate .tumblr file for your specific blog.
  • Download only specifically tagged files.
  • proxy setting
  • allow to download videos files.
  • Download photosets
  • Download inline images from other types than pictures posts (for example Question and Answers)

Bugs:

I'm completely new to C# and (safe)-threading programming and if anyone wants to help, feel free to commit. So, beware of the code ;). I'll add source code annotations over the next few days and the first git commit is the pure reverse engineered TumblOne code without any modifications from my side.

Download

Comments

John (not verified)
Fri, 25/03/2016 - 06:04

I am getting the exact same behavior. Nothing crawls on the current version when using index files from the earlier one.

Lucently (not verified)
Wed, 30/03/2016 - 04:17

There too many same pictures downloaded from different bolgs

John Albrecht (not verified)
Thu, 31/03/2016 - 15:59

I use DupeGuru and Visipics to deal with this issue. Dupeguru is nice because you can set folder to be the reference folder, and others to be the "normal" folders. That way anything in the reference folders will be kept no matter what, and any duplicates found will be removed from the normal folders. Or you can just do all as normal and dupes will be knocked out. There is a picture edition too for duplicates that have different sizes and resolutions. Visipics is great too, but it lacks some of the reference functionality. I use both, since Visipics has the ability to determine which version of a picture is better.

Ultimately though, I'm looking for a tool that can replace both, since there is some functionality missing that I want. Nothing that I have found can recognize that an image in the "reference" folder is poorer quality than what's in the "normal" folder and replace the reference picture with the superior quality one from the "normal" folder.

zab
Mon, 04/04/2016 - 13:42

Thanks for all the comments!
It is certainly possible to add that function. Are the filenames the same?

We could hash (create a unique sum) from every file we download. If the sum from a downloaded image already exists, then we simply don't save the file again. But that would still mean we have to fetch the data (file) beforehand.

If the filenames are the same on the other hand, it should be possible so simply check if in any blog index file contains the filename, then we skip it. That's basically what the "calculating new image urls .." does, but only within the same tumblrblog, not over the whole library.

John Albrecht (not verified)
Fri, 08/04/2016 - 18:58

Just to clarify, I still can't use the newest version with the 1.06 index files I have. It just sits there forever.

Dave (not verified)
Fri, 01/04/2016 - 22:54

First off, thanks for picking this project up! It's well on its way!

I am using version 1.07 on an x64 Windows 10 Pro system.

There are a couple of bugs that are repeatable on my system.

1. If I copy a link, the program does add it to my list, but it doesn't show it automatically. I have to refresh the list by clicking on the date added column twice. Once to sort it reverse, and the second time to sort it back. Once I do that, the new blog appears.

2. In my settings, I have preview disabled. When I first start the program, preview is enabled and it will display pictures if I start updating the blogs. If I click the settings icon, the preview immediately disappears without me having to do anything else. The setting is unchecked to show preview.

There are a couple of features I think would be helpful. I like to categorize my blogs by the type of pictures I am downloading. Cars, Planes, Jet Skis, etc. It would be great to have some kind of feature that lets you put the blogs in a Folder in the main display and then you can click the folder, add to queue, and have it crawl all of the blogs in that folder.

Every now and then, I copy a link, and it's https instead of http. I usually realize it after I don't see it show up in the list. So I manually paste it in to the bar, remove the S from https and then add to queue. It would be great if the program would strip that off automatically.

I have also been getting some JIT errors, but it lets me continue. I will paste the next one in here when it happens.

Thanks again for doing such a great job on this! I wish I knew programming enough to help you work on it!
Dave

zab
Mon, 04/04/2016 - 13:46

Thanks for your comment!

I'll update the "old" binary to fix the picture preview and https/http issue as both shouldn't take more than 10 minutes to fix.
For the other things i'll have to look more closely into it any maybe only add the features after I've rewritten the program in some weeks.

Thanks for the suggestions, I'll certainly add those to the list of nice things to have!

Dave (not verified)
Fri, 01/04/2016 - 23:09

Hello. Some Tumblr blogs require you to logon before you can view the site. Are there plans in the future to be able to logon through the app? I tried to crawl a blog and it shows as "offline". When I open it internet explorer it says that you must logon to Tumblr to see the site. Once I logon in my browser, I go back to TumblTwo and try again, but it doesn't recognize that it's logged on.

Thanks,
Dave

Jerome (not verified)
Tue, 05/04/2016 - 21:49

Hey....tumblthree looks really slick, as you've found a much better way to optimize layout, very NICE JOB.

I wish I knew how to code, cause I'd def love to help out with this project, and the direction you're headed.

I've been checking daily for any responses to my previous posts, and any updates for the program...so Im anxiously awaiting the next release. Also, check your email please (from me), thanks.

John (not verified)
Mon, 11/04/2016 - 15:47

Currently, Neither of the available downloads can even read the index files I have from 1.06("TumbleTwo_beta"). The current "TumbleTwo.exe" and "TumbleTwo_old.exe" just load up with lots of blank entries for each index file I have, with no text and a date of "01/01/0001 12:00:00"

As mentioned in earlier posts, the previously available version of 1.07 "TumblTwoOne" will load the index files, but fails to do a new snatch. It will get stuck at "Calculating new image URLs..." regardless of settings (Check if new images were previously downloaded from a different mirror)

In order to keep my blogs/downloads up to date, I am still using a version of 1.06 called "TumblTwo_beta.exe" that I downloaded on 2/22/2016.

Honestly, I really love the work you have done on all this, your program is awesome. But I would recommend iterating the version number in some way with each release. The different versions all sharing the same "1.06" and "1.07" is fairly confusing. I am also greatly looking forward to the ability to use my current index files on a 1.07 or newer version. I've donated in the past and I'll be happy to do so again in the future. Thank you for all your hard work.

John (not verified)
Mon, 11/04/2016 - 20:11

Okay, so the 4/11 1.07 "TumblTwo_old.exe" is working with my 1.06 index files! This is great news. It both sees the index files, and successfully downloads new images.....for the most part.

I am encountering a new problem that I have not yet experienced before. Some blogs are getting stuck. I don't know if they are stuck as they finish downloading the blog or what. One(and only one it seems) of the ones that is stuck updated the "Last Complete Crawl" to today, but others didn't. Others that completed without issue also successfully updated the "Last Complete Crawl column". But for many blogs, the "Current Process" column will eventually hit a "Downloading http://whatever " and not move again. Whatever file it is referencing in that column is already downloaded. This is also clogging up the queue. I bumped it up to 5 concurrent, but at this point, 4 are stuck and only one is doing anything. Actually, no, now all 5 are stuck, and nothing is progressing or downloading.

Sorry to keep bothering you with issues! Thanks again for the progress!

MSWallack (not verified)
Tue, 12/04/2016 - 15:42

I completely agree with the suggestion that each download have an updated version number (even if it's 1.0.7.2, for example). I'd also suggest adding that versioning next to the download link on the website.

Mat (not verified)
Thu, 21/04/2016 - 18:27

Hello! You are the greatest developer of the Universe ))) Thank You for the best software for downloading!

Thumblr keeped all video for inside it servers for 3-4 month ago and users day by day prefer to upload video, not use external links!

Please add video OnLy checkbox in your brilliant software! In the past year i saw in the your to-do list this feature, but in the past year all links were to externel sites!

Please add video crawler, now most of the video hosts in the tumblr.com.

Sorry for my english.

Skapes (not verified)
Tue, 28/06/2016 - 22:39

if you're experiencing issues or some abnormal behaviors while you're testing this (awesome) software, don't forget to get a look at the task manager... it occurs that the application is open in multiple copies, kill them and restart one instance only.

@zab: thanks for the full-screen preview ! this feature delights me, although with the latest version this goes so fast that it becomes psychedelic. 8)

Joe (not verified)
Fri, 29/07/2016 - 01:19

Hi, first of all thank you for a great software.
Second, I have a sugestion/request ;) I have a huge list of tumblr posts urls and it would be great to import xml list of these urls of posts and let tumbltwo download the content of the post (images and videos).

example links:
e.g. (http://fullthrottleauto.tumblr.com/post/148095330854, http://fullthrottleauto.tumblr.com/post/148118945754, etc)

And third: Again thank you for a great work.

Atrax (not verified)
Sun, 20/11/2016 - 05:13

1.Can't Work
after my ssd crashed, I reset the SSD, and the TumblTwo can't work. Before the Crash, it worked ok.
TumblTwo 1.06,1.07, TumblThree 1.0.2.4 all can't work, no download, reinstall the windows 8.1, still can't work.

2. I more like TumblTwo 1.06 toolbar icon,
Where can I download the 1.06 source code with the icon?

how can I contact you? twitter or facebook or some app else or just here or Email?

zab
Sun, 20/11/2016 - 08:42

As for the Icons and Source code: https://github.com/johanneszab/TumblTwo/tree/6e75cec1da8d3417e14c69d7363...
Look under TumblOne\Resources\. I had to change them since the TumblOne author wasn't too happy about TumblTwo and me using them.

The best would be email. I rarely use any social media. Please attach the error message or explain what "can't work" means. I cannot help you at all if you just say "doesn't work" since I've no clue what that means and won't be able to help you in any way.

Before you write, try the latest binaries and delete the settings under C:\Users\"YourUserName"\AppData\Local\jzab for TumblTwo or C:\Users\"YourUserName"\AppData\Local\TumblThree for TumblThree before you start them.

Abdullah (not verified)
Wed, 15/03/2017 - 07:47

Hi Jaz,
Thanks for your effort, as I was working on a similar application to scrape images also, and was stuck for Tumblr website, as it keep scrolling !!, and way I found your work , but actually ( I don't know why ) the source code did not go well with me, but on the other hand I successfully de compiled Helena Application, and it went well for me and now I'm adding new feature like what you do.

Thanks

zab
Wed, 15/03/2017 - 09:18

How did it not work? Should be straight forward loading in Visual Studio as there isn't even one external library/nuget used ..

The first commit in TumblTwo should be the most pure TumblOne source code depending on your used decompiler. I've only modified it the way that it successfully compiles and checked its functions.

If you use the api for scraping anything, it seems like they've rate limited its access now. You can check the TumblThree issues as we've recently run into this issue. In that case you might be better off scraping the archive page as I've implemented it in this TumblThree branch.

Abdullah (not verified)
Wed, 15/03/2017 - 12:55

Hi Zab,
thanks for your reply, sorry for not clearing how it did not work for me !!
it was loaded successfully on VS , but when I run the application it runs normally , and when I add tumbler blog , it gives me tumble blog " offline" and no crawling action was done, I actually I was about to debug what is going on , but an idea pops in my mind to go and try Reflector on TumbleOne !!!!!!
and waaw , it was simple and straight forward, although it took from me three days !!! to reconstruct the code as what Helena did !!
and it was simpler than all tons of libraries you added to add more features, so now working with Helena Model is easier, and I started with my own version to add my features as well, I tried to reach Helena to thank her by email , but I could not find her contact, if you want to share ideas I love too, and if you have her contact I appreciate if you could share it.

Abdullah (not verified)
Wed, 15/03/2017 - 13:03

my original problem was hot scrape these pages that has auto scroll !!! ( I'm still new to this ), that's why I found the solution in Helena code..

thanks and again

zab
Thu, 06/04/2017 - 07:48

It hits the rate limit for connections to the api, which did not exists during the TumblTwo development. It basically opens too many connections during a specific, unknown timerange which do not get a response but are immediately closed again.

see here for more: https://github.com/johanneszab/TumblThree/issues/26

if you use the serial option in the settings, this wont happen. But it will only download one file after another.

Pages