Skip to content

October 24, 2012

2

Pulling the Mixergy archive locally

by e1ven

MixergyLink

Mixergy.com is a great resource for interviews with various entrepreneurs, both famous and up-and-coming.

I generally enjoy listening to the interviews, but I recently I’ve begun trying to exercise more, and I’ve been enjoying listening to his podcast while I walk the trails.

Unfortunately, the podcasts are only available for a limited period of time, before they become premium only.

I don’t mind paying in order to download the back archives, but it’s not a straight fee-for-product transaction. At least when I last looked at it, you could only download individual interview files if you found them, one page at a time.

That’s great if you’re trying to download a single interview and listen to it, but if you want them all on your iPod to choose while walking or driving, it’s less than ideal.

Being an command line guy, I realized this should be a simple problem to solve using some bash scripting. I’m sure I could have done it in Python just as easily, but since I’m in the terminal anyway, Bash is a great Go-To solution to problems 😉

Logging in

Mixergy uses cookies for authentication, storing a login token, and then checking for it when you try to download a file.

This makes a lot of sense, and is straightforward to work with.


I logged in using Firefox, and exported out my cookies for mixergy.com, then saved them out to a file using a Firefox plugin.
I could then use this for the next set of requests.

Acquire List of Interviews

Since I couldn’t find a list of all the interviews on one page, I had to crawl backward on the news blog, harvesting each link.

I noticed that Mixergy always linked each interview in it’s own page, with “Read More” as the anchor text.

I tested pulling these links in, and it seemed to work reliably.

# Generate list of interviews

curl -L –cookies cookies http://mixergy.com/interviews/page/1/ | grep “Read more” | awk -F\” {‘print $4’}

This seemed to give me the list of interview-specific pages pretty well, so I iterated this out to the rest of the pages in the archive, moving through each page of the search results.

I added a sleep in between requests to avoid hammering the server.

# Pull ALL the Interview URLS.

for i in `seq 1 62`; do curl -L –cookies cookies http://mixergy.com/interviews/page/$i/ | grep “Read more” | awk -F\” {‘print $4’} >> interviewpages; sleep 1; done

I then used very similiar logic to extract out the specific pages for the classes.

#Do the same for the classes.

for i in `seq 1 6`; do curl -L –cookies cookies http://mixergy.com/premium/page/$i/ | grep “<h2><span>”| awk -F”a href=” {‘print $2’} | awk -F\” {‘print $2’} >> classes; sleep 1; done

Looking through these, I now had a list of URLs, each of which contained the text of the interview, and a link to the MP3 version.

Acquire Each Audio File



At this point, I just had to extract the links to the mp3s.
I tested with a single page-

#Generate list of MP3s

curl -L –cookies cookies http://mixergy.com/eddy-lu-grubwithus-interview/ | grep mp3 | awk -F “a href” {‘print $2’} | awk -F\” {‘print $2’}

This seemed to work – It gave me a URL to a single MP3.

I then rolled this through each of the single-interview pages I had downloaded before, to find the URLs of all MP3s.

#Get the MP3 list

for i in `cat interviewpages`; do sleep 1; curl -L –cookie cookies $i | grep mp3 | awk -F “a href” {‘print $2’} | awk -F\” {‘print $2’} >> interviewmp3 ;done

This gave me a list “interviewmp3” which contained a direct link to each file.
From here, it was a simple matter to loop through and download each one.

# Retrieve all MP3s.
for i in `cat interviewmp3`; do sleep 1; wget $i; done

And Success! I downloaded hundreds of startup interviews, and can load them to whatever devices I choose, and listen to them whenever I want.

Mixergy

Advertisements
Read more from Uncategorized
2 Comments Post a comment
  1. max
    Feb 18 2015

    lol can u share them?

    Reply
    • Jun 2 2015

      I’d love to, but that would be copyright infringement.

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Note: HTML is allowed. Your email address will never be published.

Subscribe to comments

%d bloggers like this: