Loggerhead doesn't support linking to the raw content

Bug #605775 reported by John A Meinel
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
loggerhead
Fix Released
Medium
Max Kanat-Alexander

Bug Description

There are 2 bits involved.

1) Something like the 'download' link that doesn't require including the file id. For example, you can currently link to:
http://bazaar.launchpad.net/~bzr-pqm/bzr/bzr.dev/annotate/head:/BRANCH.TODO
but to download the raw content it is:
http://bazaar.launchpad.net/~bzr-pqm/bzr/bzr.dev/download/head:/BRANCH.TODO-20060103052123-79ac4969351c03a9/BRANCH.TODO

Which requires the file-id. (and the path, which is a bit strange)

So updating 'download' to not require file-id would be sufficient.

2) A way to view raw content in-the-browser, rather than always being an octet stream. So 'raw' vs 'download'.

I believe there is a constraint that Launchpad imposes, that we don't want to be a generic file hosting service. (we don't want people to host all of their web images in Launchpad, and link them on their html websites.)

However, one answer there is that everything in 'raw' is either treated as 'text/plain' or 'application/octet-stream'. (that avoids image/jpg, etc.)

Related branches

John A Meinel (jameinel)
Changed in launchpad-code:
importance: Undecided → Medium
status: New → Confirmed
affects: launchpad-code → loggerhead
Changed in loggerhead:
assignee: nobody → Max Kanat-Alexander (mkanat)
status: Confirmed → In Progress
Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

What I've done is I've added a "raw" controller that actually attempts to do its best to serve files with the right MIME type. It's remarkably fast--about 0.05 seconds to get a file's entire content and all associated information, even on a large branch like launchpad.

What I haven't yet done is added XSS protection, which is something I may do in a follow-up bug, since the work required for this was complex enough as it is.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 605775] Re: Loggerhead doesn't support linking to the raw content

The security implications are pretty big; this may not fly.
Considerations: browser content sniffing and hostile content. Access
to private branch data and so forth.

I can't dig into it now, but I urge you to socialise the design and
implications with folk on the larger Launchpad team - they've learnt
the hard lessons, the hard way.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Hey Robert. I'm totally familiar with the security implications--the Bugzilla Project had a bug on this before most people on the Internet were even aware that it could possibly be a problem. What I'm saying is that the security implications will be dealt with in a follow-up bug, and in this bug they will not be.

This is not going to be deployed on Launchpad until it's secure for Launchpad, this is just going into loggerhead trunk, which is not going onto Launchpad.

Note that most loggerhead installations have no concerns whatsoever about XSS, though, BTW. There's nothing dangerous you could do to loggerhead itself, in most situations. LP happens to have private branches and credentials, so that's different.

Revision history for this message
Martin Pool (mbp) wrote :

Some discussion on the mp https://code.launchpad.net/~mkanat/loggerhead/raw-controller/+merge/42675

I think we should (or should have) split the two parts of this bug.

1- urls based only on patch is clearly useful for many urls
2- downloads that are not attachments: we should be more clear about how this is supposed to be used.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

The only URL that uses file ids by default still is the download URL. The raw controller listed in that MP is using paths like all of the other controllers, and once that MP goes in, it will be very simple to also make the download URL use paths. (The general architecture of loggerhead also still allows using file ids in the query string parameters if you want.)

As far as how the raw view is supposed to be used, I suppose there are two cases:

1) Somebody wants to see the raw content of the file quickly without any of the view or annotation issues.
2) Somebody wants to use loggerhead to serve some content.

At first I thought that #2 wasn't going to be feasible, but now I've discovered that the raw view is so fast that it could actually be done.

For #1, you could certainly say, "just serve everything as text/plain", but that doesn't actually solve the problem of XSS, because IE 7 and below will still sniff the content and render it as whatever the browser *thinks* it is. I believe it's only IE 8 and above that support X-Content-Type-Options.

So I figured I'd go with the most logical choice and attempt to serve the content with its actual, correct MIME type. That's particularly valuable for binary files like images or other media, which couldn't have a raw view otherwise.

One advantage to this also would be that it gives people the ability to rapidly get a single file out of bzr without having to check out an entire repository.

For the most part, controlling the MIME type of a file is only the illusion of security, which is worse than no security (because it makes people believe that they are secure when they are not).

The solution to the XSS problem is very much doable, and it would just involve having a secondary domain for serving raw content. I started to implement it as part of the above MP, but it turned out to be more complicated than I was expecting, so I wanted to save it for a second patch, since patches should generally be small and focused so that they can be polished and debugged appropriately (among many other important reasons to keep changes small and focused).

Revision history for this message
Martin Pool (mbp) wrote :
Download full text (3.3 KiB)

On 7 December 2010 15:51, Max Kanat-Alexander <email address hidden> wrote:
> The only URL that uses file ids by default still is the download URL.
> The raw controller listed in that MP is using paths like all of the
> other controllers, and once that MP goes in, it will be very simple to
> also make the download URL use paths. (The general architecture of
> loggerhead also still allows using file ids in the query string
> parameters if you want.)
>
> As far as how the raw view is supposed to be used, I suppose there are
> two cases:
>
> 1) Somebody wants to see the raw content of the file quickly without any of the view or annotation issues.
> 2) Somebody wants to use loggerhead to serve some content.
>
> At first I thought that #2 wasn't going to be feasible, but now I've
> discovered that the raw view is so fast that it could actually be done.

That is pretty cool.

> For #1, you could certainly say, "just serve everything as text/plain",
> but that doesn't actually solve the problem of XSS, because IE 7 and
> below will still sniff the content and render it as whatever the browser
> *thinks* it is. I believe it's only IE 8 and above that support X
> -Content-Type-Options.
>
> So I figured I'd go with the most logical choice and attempt to serve
> the content with its actual, correct MIME type. That's particularly
> valuable for binary files like images or other media, which couldn't
> have a raw view otherwise.
>
> One advantage to this also would be that it gives people the ability to
> rapidly get a single file out of bzr without having to check out an
> entire repository.
>
> For the most part, controlling the MIME type of a file is only the
> illusion of security, which is worse than no security (because it makes
> people believe that they are secure when they are not).
>
> The solution to the XSS problem is very much doable, and it would just
> involve having a secondary domain for serving raw content. I started to
> implement it as part of the above MP, but it turned out to be more
> complicated than I was expecting, so I wanted to save it for a second
> patch, since patches should generally be small and focused so that they
> can be polished and debugged appropriately (among many other important
> reasons to keep changes small and focused).

I wasn't suggesting just changing the mime type to text/plain, and I
agree that's not enough. What I was suggesting was to serve an html
page that contains nothing but the entity-escaped plain text of the
file. For "just show me the plain text" this seems like an easy/safe
solution, though it will be much less useful if you want a URL for a
machine download. However, in the second case probably attachment
disposition is enough.

So I think we have three cases:

1- I want to paste a URL to wget: use /download (is that and get it
as disposition: attachment
2- I want to see just the text in a web browser, without frills: use
/text and you get content-type: html, escaped
3- I want to serve binary assets direct from loggerhead: use /binary
and you get what this patch does, either from the same domain or a
different one. Launchpad would presumably have this off for the
moment; at any rate ...

Read more...

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Okay, that all makes sense, except that not all of the content in the VCS is text (although of course most of it is). When I ask for the raw content of a png, I as a user expect to actually view a png image in my browser. I wouldn't expect to have to go to some separate URL for that.

I suppose I could call the controllers /text/ and /content/ or something, but how would I even differentiate those in the UI? That would be confusing for the user. Also, I'd have to put a /text/ link on the inventory view (the /files browser) for every file in the inventory, because I can't reliably check the MIME type of every file when I'm displaying the contents of the inventory. So there would be /text/ links for binary files, which wouldn't make sense at all.

Revision history for this message
Martin Pool (mbp) wrote :

On 8 December 2010 02:32, Max Kanat-Alexander <email address hidden> wrote:
> Okay, that all makes sense, except that not all of the content in the
> VCS is text (although of course most of it is). When I ask for the raw
> content of a png, I as a user expect to actually view a png image in my
> browser. I wouldn't expect to have to go to some separate URL for that.

That's true.

I guess really there are two(?) types of URL: download, and view. You
should always be able to download and it should always give you c-d:
attachment; and in some cases you can view it inline.

What happens on the 'view it inline' case is going to depend on the
site configuration:

1- Don't worry about xss and are happy to just show whatever's in the
branch; too easy
2- Configure one or more separate download domains and let it be
pulled from there
3- Show content that's known to be safe. I think that in practice
determining what counts as "safe" is going to be a bit tricky, it
seems to me, because it needs to work around browser content sniffing,
but we can probably do it.
3a - Defang the content, by eg turning text into escaped html.

> I suppose I could call the controllers /text/ and /content/ or
> something, but how would I even differentiate those in the UI? That
> would be confusing for the user. Also, I'd have to put a /text/ link on
> the inventory view (the /files browser) for every file in the inventory,
> because I can't reliably check the MIME type of every file when I'm
> displaying the contents of the inventory. So there would be /text/ links
> for binary files, which wouldn't make sense at all.

I agree we don't want to check the type until the user actually asks
for it, so actually /text/ is a bad idea, as is enabling this only on
files that are safe to download.

--
Martin

Revision history for this message
Martin Pool (mbp) wrote :

So, using a separate download domain would be analogous to what we do in lp for bug attachment downloads; if it's acceptable there it should be ok here. Therefore we should have a config variable that allows you to choose between (from comment #8) number 1 and 2.

Revision history for this message
Martin Pool (mbp) wrote :
Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Yes, having a separate download domain(s) is exactly what I was going to do in a follow-up bug--I just didn't do it here because it was complex. If you want that complexity and the review of it to be mixed in to the current MP (which I don't think is a good idea) I can do that. (I'll go look at the MP comment too--haven't seen that yet.)

Revision history for this message
Martin Pool (mbp) wrote :

On 15 December 2010 16:53, Max Kanat-Alexander
<email address hidden> wrote:
> Yes, having a separate download domain(s) is exactly what I was going to
> do in a follow-up bug--I just didn't do it here because it was complex.
> If you want that complexity and the review of it to be mixed in to the
> current MP (which I don't think is a good idea) I can do that. (I'll go
> look at the MP comment too--haven't seen that yet.)

I think doing it in a separate landing is great, I just didn't want to
leave an inadvertent exposure open in the interim.

--
Martin

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Okay, so, I have a working XSS-protection branch, now, at lp:~mkanat/loggerhead/raw-prefix.

It still needs two enhancements/fixes, though:

* Right now if you specify both --prefix and --raw-prefix, --raw-prefix gets overridden by the PrefixMiddleware. Does anybody know how to disable PrefixMiddleware for just one specific response?

* It needs the ability to have different URLs for each branch.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

I worked on it more today. Now it has the ability to have different URLs for each branch, but it still has two problems:

* It doesn't work OK with --prefix.
* It doesn't work behind a proxy.

These are both a result of the same problem--that PrefixMiddleware comes after the app creation, but the is the only place where we can know which parts of the path are a branch and which part are a controller.

Perhaps I'll have to put some code that sets environ['loggerhead.is_raw_controller'] during app creation, and write some middleware that does the redirection later.

Revision history for this message
Max Kanat-Alexander (mkanat) wrote :

Ah, nevermind! Was a bug in my code. Prefix and proxies work fine. There will be an MP coming up soon.

Changed in loggerhead:
status: In Progress → Fix Committed
Changed in loggerhead:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.